fullstorydev / hauser

Service for moving your Fullstory export files to a data warehouse
MIT License
49 stars 23 forks source link

Convert to UTC before saving sync point to column with data type timestamp without timezone #69

Closed butanian closed 4 years ago

butanian commented 4 years ago

This PR converts the bundle end time to UTC before persisting it to the sync table on Redshift. We have to do this because the type of the column on redshift is timestamp without timezone.

High level flow in Hauser 1) Fetch the latest bundleEndTime from the sync table 2) Use the bundleEndTime as the start param when calling the export/list end point. This returns all the bundles starting from the bundle that contains the provided start time. 3) Iterate over all the bundles returned from export/list

The Problem Since the column on redshift doesn't have timezone, when we read the column we assume the time has the local timezone. For our customers in the US, it shifts the time back by a few hours. When this time was sent as the start param to the export/list endpoint it returned the last available bundle over and over again.

After converting to UTC, before persisting we accurately interpret the 'bundleEndTime' column in the sync table, and accurately determine there are no more bundles to process instead of repeatedly querying the last bundle repeatedly until a new bundle becomes available.

I should also note duplicate records are not found in the export. We have a mechanism in place that protects us from cases where the export table is updated, but the sync table fails to update. This mechanism has been preventing us from writing duplicates in this situation as well.

EDIT (2019-09-06): I may have spoken too soon about duplicate records. If someone has a bundle size that is greater than the time by which the shift occurs, then they will have duplicate records because the RemoveOrphanedRecords fn will only remove a subset of the previous bundle before re-processing and inserting the entire bundle again.

jameremo commented 4 years ago

Is the problem on read or write of timestamp values?

I'm not a Redshift expert, so for reference/shared context I've just finished studying this page and a few of the other pages to which it links: https://docs.aws.amazon.com/redshift/latest/dg/r_Datetime_types.html

Here is my understanding after reading the above and rereading the code:

  1. We're using TIMESTAMP instead of TIMESTAMPTZ as the Redshift type for all golang time.Time fields in the sync table. The Redshift convention for TIMESTAMP columns is that all column values are in UTC.
  2. The postgres SQL driver, following this convention, is doing the timezone conversion correctly when we read data from Redshift.
  3. When we write dates into Redshift, we format them as string literals, using the host system's timezone. Thus we have a string formatted for the local timezone that is interpreted as UTC by Redshift, and here is where the problem occurs.

Is that right? And if so, should we not also do the UTC conversion on the bundle processed time in the SaveSyncPoints function? And are we encountering the same problem with the actual data that we're writing, or does the COPY command along with the timezones in the formatted CSV values keep us from hitting this problem?

butanian commented 4 years ago

@jameremo your assessment of the situation is spot on! Couldn't have phrased it any better!

I didn't update the bundle's processed time, since Hauser doesn't rely on it. I thought having the timestamp match the users timezone made sense if the column is only being inspected by real humans. Having said that it is weird to have different timezones for timestamps in the same table. Not married to the idea, and would happily make the change.

butanian commented 4 years ago

also I'm not sure about how the COPY command processed datetimes before writing to a timestamp without timezone column. Will investigate, and address in a separate PR if something needs to be done.

butanian commented 4 years ago

@jameremo converted processed time to UTC as well. PTAL