apache / superset

Apache Superset is a Data Visualization and Data Exploration Platform
https://superset.apache.org/
Apache License 2.0
62.8k stars 13.87k forks source link

No result when range is less than a day with Clickhouse #19154

Closed vincentbernat closed 2 months ago

vincentbernat commented 2 years ago

Often, Clickhouse tables have both a Date and a DateTime column. The first one is to help data partitioning. The Clickhouse driver in Superset is correctly including both of them when specifying time ranges. However, when the requested length is less than a day, it generates an SQL request which cannot have a result as the start day is inclusive and the end day is exclusive.

How to reproduce the bug

  1. Here is a schema example:
CREATE TABLE IF NOT EXISTS flows
(
    Date Date,
    TimeReceived DateTime,
    TimeFlowStart DateTime,

    SrcAS UInt32,
    DstAS UInt32,
) ENGINE = MergeTree()
    PARTITION BY Date
    ORDER BY TimeReceived;
  1. Select TimeReceived as a time column and a time range of 6 hours.

Expected results

I should get some data.

Actual results

No data available.

Screenshots

image

Here is the generated SQL request:

superset_1              | 2022-03-15 14:27:07,476:DEBUG:clickhouse_driver.connection:Query: SELECT toStartOfMinute(toDateTime("TimeReceived")) AS __timestamp,
superset_1              |        "SrcAS" AS "SrcAS",
superset_1              |        SUM(Bytes*SamplingRate) AS "Bytes"
superset_1              | FROM "default".flows
superset_1              | WHERE "Date" >= toDate('2022-03-15')
superset_1              |   AND "Date" < toDate('2022-03-15')
superset_1              |   AND "TimeReceived" >= toDateTime('2022-03-15 08:27:07')
superset_1              |   AND "TimeReceived" < toDateTime('2022-03-15 14:27:07')
superset_1              | GROUP BY "SrcAS",
superset_1              |          toStartOfMinute(toDateTime("TimeReceived"))
superset_1              | ORDER BY "Bytes" DESC
superset_1              | LIMIT 10000

Environment

Checklist

Make sure to follow these steps before submitting your issue - thank you!

rusackas commented 1 year ago

@shreesham another clickhouse issue that could use some validation in current Superset if we're able.

pasqualtroncone commented 1 year ago

@rusackas looks like the same thing I just posted here: https://github.com/apache/superset/issues/19404#issuecomment-1700171750.

Look at the query @vincentbernat posted

SELECT toStartOfMinute(toDateTime("TimeReceived")) AS __timestamp,
       "SrcAS" AS "SrcAS",
       SUM(Bytes*SamplingRate) AS "Bytes"
FROM "default".flows
WHERE "Date" >= toDate('2022-03-15')
  AND "Date" < toDate('2022-03-15')
  AND "TimeReceived" >= toDateTime('2022-03-15 08:27:07')
  AND "TimeReceived" < toDateTime('2022-03-15 14:27:07')
GROUP BY "SrcAS",
         toStartOfMinute(toDateTime("TimeReceived"))
ORDER BY "Bytes" DESC
LIMIT 10000

It has a double filter by dates which is likely to result in no matches being returned.

I bet Datefield is the default datetime column on the Dataset and TimeReceived is the TIME COLUMN of the chart. @vincentbernat can you confirm ?

vincentbernat commented 1 year ago

Yes. In the meantime, I was told it was possible to partition with an expression, so no need to have a Date column. This should also fix the issue here. Nonetheless, people keeping a separate column for that will still run into this issue.

rusackas commented 8 months ago

If there's a workaround, is there still a real "bug" here, or is there a proposal of how to resolve this? Maybe there should be an addition to the documentation on the Clickhouse page if this is indeed a Clickhouse issue?

rusackas commented 2 months ago

Sounds like there's no activity here, so I'll close this as stale. Holler if it ought to be reopened, though anyone is able to improve docs or open PRs to dial things in with or without the issue being open :)