ccsd / canvas-data-embulk-configs

YAML configs for importing Canvas Data with Embulk
0 stars 1 forks source link

requests.timestamp_day cannot be converted to default timezone #4

Open robert-carroll opened 4 years ago

robert-carroll commented 4 years ago

This field can't be converted to a local default timezone from UTC without a timestamp. The field is stored in VARCHAR which makes some queries using this field problematic. You can store the field using Embulk Output SQL type 'date', but there will be no conversion.

Best solution would be to load the table to staging and convert that value in production locally... or just use timestamp and convert to create a new value from timestamp.

Example: https://github.com/ccsd/canvas-data-embulk-configs/issues/4#issuecomment-565608140

lafent commented 4 years ago

Can this be accomplished via an existing embulk filter plugin? I'm not sure what plugins anyone is using, but I've got a couple I've used for importing some pretty messy data. I'm about 90% sure I've got one timestamp field that's partially parsed from character data. I'll take a whack at it. I'm not sure that's really a viable option for everyone, but I can document it as an option at least -- assuming it actually works like I remember.

lafent commented 4 years ago

I've played around with this a little bit and I'm not sure I really understand what the problem is.

My current config for the requests data is (MySQL-centric, irrelevant bits removed)

in:
    parser:
        columns:
        - {name: timestamp, type: timestamp, format: '%Y-%m-%d %H:%M:%S.%L'}
out:
    default_timezone: 'America/Chicago'
    column_options:
        timestamp: {type: 'DATETIME NULL', value_type: timestamp, timestamp_format: '%Y-%m-%d %H:%M:%S.%L'}

This gets me the timestamp in my local time. I haven't cross-checked vs. daylight savings and I do have some other plugins installed. I don't think anything is mutating the workflow, but I haven't tried to run a vanilla instance either.

robert-carroll commented 4 years ago

The problem only exists in the timestamp_day, timestamp_month, and timestamp_year columns, as there's no time to evaluate a timezone difference. Therefore, if you run a SELECT ... WHERE timestamp_day = '2019-12-13', the results would be off by the difference of Local to UTC.

robert-carroll commented 4 years ago

I have some time today, I'll play with this one, https://github.com/sonots/embulk-filter-timestamp_format

lafent commented 4 years ago

That's actually one of the plugins I have (for a different problem). It's worked pretty well for handle the edge cases I've encountered.

On Fri, Dec 13, 2019 at 2:22 PM Robert Carroll notifications@github.com wrote:

I have some time today, I'll play with this one, https://github.com/sonots/embulk-filter-timestamp_format

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ccsd/canvas-data-embulk-configs/issues/4?email_source=notifications&email_token=AAGE35CEEWVYEMSSMK3EFATQYPVGRA5CNFSM4JK5WEHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG3EX7Y#issuecomment-565595135, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGE35H4INTU6MR4Q7E7LVTQYPVGRANCNFSM4JK5WEHA .

robert-carroll commented 4 years ago

Here's a better representation of the problem, for detail, not for explanation.

Importing a single days request files, 1 sequence from cd:cli fetch command

Using timestamp_day

SELECT
    timestamp_day
    , MIN(timestamp) AS [first]
    , MAX(timestamp) AS [last]
    , COUNT(*) total
FROM requests
GROUP BY timestamp_day
ORDER BY [last] DESC
timestamp_day first last total
2019-12-09 2019-12-09 16:00:00.0090000 2019-12-10 15:59:59.9070000 3467752

Resulting in 3,467,752 rows returned for 12/9

Using CONVERT(date, timestamp)

SELECT
    CONVERT(date, timestamp)
    , MIN(timestamp) AS [first]
    , MAX(timestamp) AS [last]
    , COUNT(*) total
FROM requests
GROUP BY CONVERT(date, timestamp)
ORDER BY [last] DESC

While timestamp_day indicates 12/9, with UTC to America/Los_Angeles conversion, many of those rows are for 12/10.

CONVERT(date, timestamp) first last total
2019-12-10 2019-12-10 00:00:00.2560000 2019-12-10 15:59:59.9070000 2674451
2019-12-09 2019-12-09 16:00:00.0090000 2019-12-09 23:59:59.8540000 793301
3467752

Resulting in 2,674,451 rows returned for 12/9, but this is still not complete.

Driving it home, importing 2 sequences of requests*.gz

CONVERT(date, timestamp) first last total
2019-12-10 2019-12-10 00:00:00.2560000 2019-12-10 15:59:59.9070000 2674451
2019-12-09 2019-12-09 00:00:00.0150000 2019-12-09 23:59:59.8540000 3377973
2019-12-08 2019-12-08 16:00:00.0140000 2019-12-08 23:59:59.9990000 613354

With a complete dataset, there are actually 3,377,973 rows returned for 12/9

robert-carroll commented 4 years ago

https://github.com/treasure-data/embulk-filter-add_time/issues/11