ClickHouse / ClickHouse

ClickHouse® is a real-time analytics DBMS
https://clickhouse.com
Apache License 2.0
37.68k stars 6.91k forks source link

improvements for timezone support #1225

Closed silviucpp closed 7 years ago

silviucpp commented 7 years ago

Hello,

Would be nice if all datetime functions which now supports as second parameter a string representing the timezone will support also an integer that specify the timezone offset in seconds.

so Instead specifying toRelativeMonthNum(date, 'Europe/Bucharest') I can specify toRelativeMonthNum(date, 10800) . This will make more easy working with date times in ISO 8601 format where you receive dates like: 2017-07-01T00:00:00+03:00

What do you think ?

silviucpp commented 7 years ago

Beside my previous questions there is any challenge to support fractional offsets like Asia/Kolkata which is UTC+05:30 ? Or just something that was missed ?

alexey-milovidov commented 7 years ago

Would be nice if all datetime functions which now supports as second parameter a string representing the timezone will support also an integer that specify the timezone offset in seconds.

This is reasonable, and it seems relatively easy to implement.

Beside my previous questions there is any challenge to support fractional offsets like Asia/Kolkata which is UTC+05:30 ? Or just something that was missed ?

There is some challenge. Currently the code in DateLUT use the assumption that UTC offset is whole number of hours. If we drop this assumption, the code will become somewhat slower.

But the proper solution is possible: we need to provide several different function specializations in DateLUT - one for timezones with whole offsets and one with support of fractional timezones.

Then we need to dispatch for proper specialization, but we need to take care to do dispatch outside of inner loops.

silviucpp commented 7 years ago

Hy Alexey can this become a priority for you ? I checked a bit your code and I found that:

I'm ok to contribute to the code if I have some clear guides from your side ! Silviu

alexey-milovidov commented 7 years ago

for each timezone at this point you create an instance of that timezone (as a singleton of DateLUT) which means if we add seconds as second param we will have lot of singletons created for each offset (timezone names are limited - number of seconds somehow are not) you don't consider this as a problem ?

I think, we can use "shortcut" implementation for integer timezone parameter: we add offset in seconds to timestamp, then do all transformations in UTC DateLUT, then subtract offsets back if needed. This is possible because when you specify concrete UTC offset instead of time zone, you don't have to deal with daylight saving time and historical timezone changes.

I'm ok to contribute to the code if I have some clear guides from your side!

This will be strongly appreciated!

filimonov commented 7 years ago

for each timezone at this point you create an instance of that timezone (as a singleton of DateLUT) which means if we add seconds as second param we will have lot of singletons created for each offset (timezone names are limited - number of seconds somehow are not) you don't consider this as a problem ?

I think, we can use "shortcut" implementation for integer timezone parameter: we add offset in seconds to timestamp, then do all transformations in UTC DateLUT, then subtract offsets back if needed. This is possible because when you specify concrete UTC offset instead of time zone, you don't have to deal with daylight saving time and historical timezone changes.

Actually the same can be done without any changes in code:

:) select timezone();

SELECT timezone()

┌─timezone()────┐
│ Europe/Berlin │
└───────────────┘

1 rows in set. Elapsed: 0.001 sec. 

:) select toString( now() );

SELECT toString(now())

┌─toString(now())─────┐
│ 2017-09-25 17:08:08 │
└─────────────────────┘

1 rows in set. Elapsed: 0.001 sec. 

:) select toString( now() + 7200, 'UTC' );

SELECT toString(now() + 7200, 'UTC')

┌─toString(plus(now(), 7200), \'UTC\')─┐
│ 2017-09-25 17:08:15                  │
└──────────────────────────────────────┘

1 rows in set. Elapsed: 0.001 sec. 
silviucpp commented 7 years ago

I think my observations or mostly for toRelative[*]Num functions where your example is no longer working.

filimonov commented 7 years ago

Sorry, i don't get it. Can you give some example?

silviucpp commented 7 years ago

Basically your example just convert a date into another timezone. This is simple to achieve. But in case you want to group your results let's say daily you have to use toRelativeDayNum(date). In this case if you also want to take the timezone into account you can use toRelativeMonthNum(date, timezone) (A day into a region is not the same as the one in another region: for example in Romania they day ends at 21:00 UTC).

Problem is:

  1. clickhouse is not working with fractional offsets like Asia/Kolkata
  2. toRelativeMonthNum only supports timezone as a second param. If you have ISO8601 dates the timezone its specified as an offset so will be easy to have toRelativeMonthNum to accept instead of a named timezone, a second param representing the offset in seconds.

Silviu

filimonov commented 7 years ago

Basically your example just convert a date into another timezone.

Yes, and by introducing the fixed seconds offset you're getting exactly the same effect. But without extra knowlegde about DST, historical changes of local laws connected to local time.

DateTime in clickhouse, AFAIK, is just a regular unix timestamp. I.e. number of seconds since begining of Unix Epoch (1st January of 1970 in UTC). You can make any arithmetic operatopns with that number of seconds before conversion and you will get exactly the same result as you expect.

For example: 1000000000 seconds since 1970-01-01T00:00:01+00:00 is 2001-09-09T01:46:40+00:00; So in UTC is was 9th of September 1:46:40am. In Moscow timestamp 1000000000 is 09.09.2001 05:46:40. In New-York timestamp 1000000000 means 08.09.2001 21:46:40.

You can use that tz data with named timezones to make that conversion. OR: you can just make simple arithmetic operation with timestamp and always convert it to UTC.

So 1000000000 + 4*3600 (4 hours - offset of Europe/Moscow) = 1000014400 If you convert it into human readable date (UTC timezone) you will get: 2001-09-09T05:46:40+00:00

And 1000000000 - 4*3600 (-4 hours - offset of America/New York) = 999985600 Convert it to the UTC time and you will get 2001-09-08T21:46:40+00:00

P.S. Completely agree that support of fractional offsets like Asia/Kolkata (or even funnier Asia/Kathmandu) is needed. What's how i find that issue :)

silviucpp commented 7 years ago

Ok,

Let's suppose the following use case. All data are stored in a table in UTC. Then later you want to do a count grouped daily between two dates.

What you do is :

select count(*) as count, toRelativeDayNum(date) - toRelativeDayNum(X) from table when date >= X and date =< Y

In this moment your result will be grouped daily but UTC based:

1970-01-02T00:00:00 UTC 1970-01-03T00:00:00 UTC 1970-01-04T00:00:00 UTC ..

Let's suppose you want the same result but in Europe/Bucharest TZ (each day ends at 1970-01-0XT21:00:00 UTC) then you need to use :

select count(*) as count, toRelativeDayNum(date, "Europe/Bucharest") - toRelativeDayNum(X, "Europe/Bucharest") from table when date >= X and date =< Y

In this moment your result will be grouped daily but based on the new TZ you specified:

1970-01-02T21:00:00 UTC 1970-01-03T21:00:00 UTC 1970-01-03T21:00:00 UTC

toRelativeDayNum works nice but it's not so flexible as time knows about only named timezones. Supporting offset will bring more flexibility

filimonov commented 7 years ago
:) select toRelativeDayNum(toDateTime(1000000000), 'America/New_York');

SELECT toRelativeDayNum(toDateTime(1000000000), 'America/New_York')

┌─toRelativeDayNum(toDateTime(1000000000), \'America/New_York\')─┐
│                                                          11573 │
└────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.004 sec.

:) select toRelativeDayNum(toDateTime(1000000000)-4*3600, 'UTC');

SELECT toRelativeDayNum(toDateTime(1000000000) - (4 * 3600), 'UTC')

┌─toRelativeDayNum(minus(toDateTime(1000000000), multiply(4, 3600)), \'UTC\')─┐
│                                                                       11573 │
└─────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.001 sec.

:) select toRelativeDayNum(toDateTime(1000000000), 'Europe/Moscow');

SELECT toRelativeDayNum(toDateTime(1000000000), 'Europe/Moscow')

┌─toRelativeDayNum(toDateTime(1000000000), \'Europe/Moscow\')─┐
│                                                       11574 │
└─────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.001 sec.

:) select toRelativeDayNum(toDateTime(1000000000) + 4*3600, 'UTC');

SELECT toRelativeDayNum(toDateTime(1000000000) + (4 * 3600), 'UTC')

┌─toRelativeDayNum(plus(toDateTime(1000000000), multiply(4, 3600)), \'UTC\')─┐
│                                                                      11574 │
└────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.001 sec.
filimonov commented 7 years ago

In your sample : try this (if your default tz is UTC)

select count(*) as count, toRelativeDayNum(date+3*3600) - toRelativeDayNum(X+3*3600) from table
when date >= X and date =< Y
silviucpp commented 7 years ago

@filimonov yes very interesting your suggestion :) this way you can bypass also the issue with fractional tz that's missing. do I miss something ?

filimonov commented 7 years ago

It will work correctly (also for fractional offsets) if you know exactly the offset, which is not a constant value in time. :/

So for example for central europenian time (CET) you should use +2 hours offset in the winter, and +1 in the summer, and you should know exactly when to switch. If you also have a historical data you also need to know when how it was in past, etc. That's why all that's named timezones were introduced, and tzdata database is maintained.

silviucpp commented 7 years ago

Ohh right..

filimonov commented 7 years ago

Anyway - i just wanted to say that there is a quite simple (may be a bit tricky) workaround. And fixed timezone offset is connected with certain risks (user should be aware that offset can change). So not sure if that feature is really useful.

About fractional timezones - I've created separate ticket: #1298

silviucpp commented 7 years ago

This seems fixed in last master in commit 4236bcbfa2365faf569b12cde64e6bf62d17517a