Closed stoggi closed 6 years ago
After thinking about it, incrementing _last_event_timestamp
is not a good solution. For example, if there are 15 events on the same millisecond, but in the first batch we only received the first 5. Then in the second batch we could expect the next 10 events with the same timestamp.
@stoggi so what's the desired outcome here do you think? from what I've seen, it's very rare to have events at the exact same ms, but I could be wrong
@ryandeivert I'm not sure exactly. We could store the last n event ids, and de-duplicate them.
It's not really clear what their API does with the timestamps. According to the docs they only specify in seconds:
https://developers.google.com/admin-sdk/reports/v1/reference/activities/list
items[].id.time | datetime Time of occurrence of the activity. This is in UNIX epoch time in seconds.
It must be fractional seconds, but to what precision? RFC3339 also doesn't specify fractional precision https://www.ietf.org/rfc/rfc3339.txt for the format 2010-10-28T10:26:35.000Z
I had a go at incrementing the timestamp, and ended up with something like this:
if not self._next_page_token:
# Increment the last event time by one millisecond to exlcude it from the next poll
last_event_time = datetime.strptime(activities[0]['id']['time'], self.date_formatter())
last_event_time += timedelta(milliseconds=1)
self._last_timestamp = last_event_time.strftime(self.date_formatter())
LOGGER.debug('Caching last timestamp: %s', self._last_timestamp)
I also had to modify date_formatter()
to add %.f
:
@classmethod
def date_formatter(cls):
"""Return a format string for a date, ie: 2010-10-28T10:26:35.000Z"""
return '%Y-%m-%dT%H:%M:%S.%fZ'
Python actually uses microseconds for %f
so you end up with 2010-10-28T10:26:35.000000Z
https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior
@stoggi are you having success with the incrementing approach you've listed? I have been able to verify the issue but not test the fix you've outlined. I have this 'in progress' now but haven't really moved with it - would you like to contribute the fix? if so, I can reassign
Yep sure. I'm happy to contribute a PR for this.
The fix does exclude the last event from the next API query, although I don't know if I am missing any events.
Perhaps the simplest way is to record the unique ids items[].id.uniqueQualifier
for all events with the same timestamp as the last event. Then remove them in the next run.
Awesome, thank you!! I like the idea of sorta de-duping, but the current implementation doesn't easily allow for storing arbitrary state info. I'm open to ideas to support this as well, if you're up for it! The current state object will always be:
{
"current_state": "...",
"last_timestamp": "..."
{
and is saved as json, performed here:
Theoretically we could add a third top level key of custom_state
or something similar to dump a custom state object to... thoughts?
Background
The
startTime
that is passed to the activities API is set to the timestamp of the last event.https://github.com/airbnb/streamalert/blob/a7eaa30516856e17303244fc99cb386be697e69f/stream_alert/apps/_apps/gsuite.py#L127-L132
But the response from the next call to the activity API includes that last event. The
startTime
must be inclusive.https://developers.google.com/admin-sdk/reports/v1/reference/activities/list
Description
The
startTime
passed to the Google Reports API includes the last event already processed in the previous lambda execution.Steps to Reproduce
Observed in the CloudWatch event logs for the GSuiteReports app:
Notice the last time stamp on the 3rd of August, even though the log is from the 5th of August. And there was 1 log sent to the rule_processing lambda. I observed this same log at every scheduled invocation where there were no new events.
Desired Change
We could increment
_last_event_timestamp
by one millisecond, or exclude the last seen report id. What would happen if multiple events occurred on the same millisecond?