ClickHouse / ClickHouse

ClickHouse® is a real-time analytics DBMS
https://clickhouse.com
Apache License 2.0
37.08k stars 6.84k forks source link

About with expression and with fill expression #7752

Open zhang2014 opened 4 years ago

zhang2014 commented 4 years ago

Use case

WITH INTERVAL 1 hour AS granule, 
toStartOfInterval(toDateTime('2019-01-01 00:00:00'), granule) AS begin_time, 
toStartOfInterval(toDateTime('2019-07-01 00:00:00'), granule) AS end_time
SELECT avg(value_int32) AS value, toStartOfInterval(action_time, granule) AS action_time
FROM iot_device_data
WHERE user_id = 'usr-kLVVBDqd' AND device_id = 'iotd-0572a3b9-7862-4ff4-ac82-1e388e7c25f9'
  AND action_time > begin_time AND action_time < end_time
GROUP BY action_time ORDER BY action_time WITH FILL FROM begin_time TO end_time STEP granule

Additional context

stas-sl commented 4 years ago

As a workaround it is possible to use WITH FILL STEP dateDiff('second', now()), now() + INTERVAL 1 HOUR) to convert interval to seconds (not sure if there is a better way), but of course it would be more convenient to use time interval expressions directly.

PHameete commented 3 years ago

This would be a very useful feature for filling gaps in time series data. My use case looks a lot like @zhang2014 's use case:

WITH 
  INTERVAL 5 minute as granule, 
  toDateTime('2021-02-19 09:25:00') as from_date, 
  toDateTime('2021-02-19 09:50:00') as to_date
SELECT subject_id, toStartOfInterval(start_date, granule) as date, sum(count) as value
FROM series_aggregation_five_mins
WHERE project_id = '1ec4fdbb-6783-4dba-b1ab-c7ac4f0859b8'
  AND metric_id = 'd2e5123e-6a32-4e4f-98fc-3f2518236b97'
  AND start_date >= from_date
  AND start_date < to_date
GROUP BY subject_id, date
ORDER BY subject_id ASC, date ASC WITH FILL FROM from_date TO to_date STEP granule

Note that using @stas-sl workaround can work, but not when you start using STEP sizes of 1 day or more: because of DST changes a day is not always a fixed amount of seconds.

An additional issue I have is when filling with a double GROUP BY as in my use case above. The subject_id is filled with zero UUIDs. Instead, I would like the 'current' subject_id to extrapolate to the filled date.

Current response:

┌───────────────────────────subject_id─┬────────────────date─┬─value─┐
│ d5e8bab7-3a1b-4003-8381-07a825428559 │ 2021-02-19 09:30:00 │    89 │
│ 00000000-0000-0000-0000-000000000000 │ 2021-02-19 09:35:00 │     0 │
│ d5e8bab7-3a1b-4003-8381-07a825428559 │ 2021-02-19 09:40:00 │    38 │
└──────────────────────────────────────┴─────────────────────┴───────┘

Desired response:

┌───────────────────────────subject_id─┬────────────────date─┬─value─┐
│ d5e8bab7-3a1b-4003-8381-07a825428559 │ 2021-02-19 09:30:00 │    89 │
│ d5e8bab7-3a1b-4003-8381-07a825428559 │ 2021-02-19 09:35:00 │     0 │
│ d5e8bab7-3a1b-4003-8381-07a825428559 │ 2021-02-19 09:40:00 │    38 │
└──────────────────────────────────────┴─────────────────────┴───────┘
xoelop commented 3 years ago

@PHameete you can probably do that with some array functions, arrayFill is what you need. Here's a practical example: https://blog.tinybird.co/2021/05/18/tips-7-filling-null-values-on-clickhouse/

PHameete commented 3 years ago

Cheers for the pointer I'll give that a try! My current solution is to generate zeroes beforehand for cases where this is relevant (sums and counts) and to then use UNION ALL to merge with the query without fills.

The original remark still holds: using time intervals as step size would be very valuable, particularly for days and larger intervals as then the workaround above will not work due to DST.

den-crane commented 2 years ago

DST, yeah, step is the problem, one day has 23 hours :(, one 25.

select toDateTime('2021-11-01 00:00:00', 'America/Los_Angeles') + interval number day as d from numbers(10)
order by d
with fill from toDateTime('2021-11-01 00:00:00', 'America/Los_Angeles') to toDateTime('2021-11-12 00:00:00', 'America/Los_Angeles') step 86400;

┌───────────────────d─┐
│ 2021-11-01 00:00:00 │
│ 2021-11-02 00:00:00 │
│ 2021-11-03 00:00:00 │
│ 2021-11-04 00:00:00 │
│ 2021-11-05 00:00:00 │
│ 2021-11-06 00:00:00 │
│ 2021-11-07 00:00:00 │
│ 2021-11-07 23:00:00 │
│ 2021-11-08 00:00:00 │
│ 2021-11-08 23:00:00 │
│ 2021-11-09 00:00:00 │
│ 2021-11-09 23:00:00 │
│ 2021-11-10 00:00:00 │
└─────────────────────┘
┌───────────────────d─┐
│ 2021-11-10 23:00:00 │
│ 2021-11-11 23:00:00 │
└─────────────────────┘
PHameete commented 2 years ago

Hi @den-crane thanks for the reply. I think supporting day or larger periods as step for fills can be a very useful feature because it saves a bit of effort and query complexity.

How I do this now is as follows. For example if I want 60 times a 3 day bucket with the first bucket '2019-01-01 00:00:00':

SELECT arrayJoin(arrayMap(x -> toTimeZone(date_add(DAY, x * 3, toDateTime('2019-01-01 00:00:00')), 'UTC'), range(60))) as bucket_date, 0 as value

This works with DST as well because the date_add function handles this properly.

Then the result of this zeroes SELECT is UNION ALL-ed with the result of my normal query and aggregated (for example sum'ed if I am getting counts with zero fills).