Open robert-carroll opened 4 years ago
Can this be accomplished via an existing embulk filter plugin? I'm not sure what plugins anyone is using, but I've got a couple I've used for importing some pretty messy data. I'm about 90% sure I've got one timestamp field that's partially parsed from character data. I'll take a whack at it. I'm not sure that's really a viable option for everyone, but I can document it as an option at least -- assuming it actually works like I remember.
I've played around with this a little bit and I'm not sure I really understand what the problem is.
My current config for the requests
data is (MySQL-centric, irrelevant bits removed)
in:
parser:
columns:
- {name: timestamp, type: timestamp, format: '%Y-%m-%d %H:%M:%S.%L'}
out:
default_timezone: 'America/Chicago'
column_options:
timestamp: {type: 'DATETIME NULL', value_type: timestamp, timestamp_format: '%Y-%m-%d %H:%M:%S.%L'}
This gets me the timestamp in my local time. I haven't cross-checked vs. daylight savings and I do have some other plugins installed. I don't think anything is mutating the workflow, but I haven't tried to run a vanilla instance either.
The problem only exists in the timestamp_day
, timestamp_month
, and timestamp_year
columns, as there's no time to evaluate a timezone difference. Therefore, if you run a SELECT ... WHERE timestamp_day = '2019-12-13'
, the results would be off by the difference of Local to UTC.
I have some time today, I'll play with this one, https://github.com/sonots/embulk-filter-timestamp_format
That's actually one of the plugins I have (for a different problem). It's worked pretty well for handle the edge cases I've encountered.
On Fri, Dec 13, 2019 at 2:22 PM Robert Carroll notifications@github.com wrote:
I have some time today, I'll play with this one, https://github.com/sonots/embulk-filter-timestamp_format
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ccsd/canvas-data-embulk-configs/issues/4?email_source=notifications&email_token=AAGE35CEEWVYEMSSMK3EFATQYPVGRA5CNFSM4JK5WEHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG3EX7Y#issuecomment-565595135, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGE35H4INTU6MR4Q7E7LVTQYPVGRANCNFSM4JK5WEHA .
Here's a better representation of the problem, for detail, not for explanation.
timestamp_day
SELECT
timestamp_day
, MIN(timestamp) AS [first]
, MAX(timestamp) AS [last]
, COUNT(*) total
FROM requests
GROUP BY timestamp_day
ORDER BY [last] DESC
timestamp_day | first | last | total |
---|---|---|---|
2019-12-09 | 2019-12-09 16:00:00.0090000 | 2019-12-10 15:59:59.9070000 | 3467752 |
Resulting in 3,467,752 rows returned for 12/9
CONVERT(date, timestamp)
SELECT
CONVERT(date, timestamp)
, MIN(timestamp) AS [first]
, MAX(timestamp) AS [last]
, COUNT(*) total
FROM requests
GROUP BY CONVERT(date, timestamp)
ORDER BY [last] DESC
While timestamp_day indicates 12/9, with UTC to America/Los_Angeles conversion, many of those rows are for 12/10.
CONVERT(date, timestamp) | first | last | total |
---|---|---|---|
2019-12-10 | 2019-12-10 00:00:00.2560000 | 2019-12-10 15:59:59.9070000 | 2674451 |
2019-12-09 | 2019-12-09 16:00:00.0090000 | 2019-12-09 23:59:59.8540000 | 793301 |
3467752 |
Resulting in 2,674,451 rows returned for 12/9, but this is still not complete.
CONVERT(date, timestamp) | first | last | total |
---|---|---|---|
2019-12-10 | 2019-12-10 00:00:00.2560000 | 2019-12-10 15:59:59.9070000 | 2674451 |
2019-12-09 | 2019-12-09 00:00:00.0150000 | 2019-12-09 23:59:59.8540000 | 3377973 |
2019-12-08 | 2019-12-08 16:00:00.0140000 | 2019-12-08 23:59:59.9990000 | 613354 |
With a complete dataset, there are actually 3,377,973 rows returned for 12/9
This field can't be converted to a local default timezone from UTC without a timestamp. The field is stored in VARCHAR which makes some queries using this field problematic. You can store the field using Embulk Output SQL type 'date', but there will be no conversion.
Best solution would be to load the table to staging and convert that value in production locally... or just use timestamp and convert to create a new value from timestamp.
Example: https://github.com/ccsd/canvas-data-embulk-configs/issues/4#issuecomment-565608140