ClickHouse / clickhouse-connect

Python driver/sqlalchemy/superset connectors
Apache License 2.0
333 stars 65 forks source link

Support SQLAlchemy AggregateFunction type for Superset #104

Open Valenzione opened 1 year ago

Valenzione commented 1 year ago

Hi there!

I use Clickhouse for our BI (superset). Sometimes we have data marts which contain AggregateFunction(xxx, yyy) columns. This is useful when we want to calculate aggregation at different levels (hourly, daily, etc) but we don't want to store raw data.

I'd like to see such columns serialised in superset datasets. I won't use them as dimensions, filters, etc, but I'll create a metric like uniqMerge(aggregation_column) and will be able to compute aggregations at different levels as I wish without storing raw data.

Currently we cant import such datasets into Superset because of clickhousedb error: AggregateFunction(uniq, UUID) deserialization not supported

My setup: clickhouse-connect==0.5.3 and superset==2.0.1

Part of table DDL (one of many):

CREATE TABLE table_with_aggregates
(
    hour        DateTime,
    platform    LowCardinality(String),
    count       SimpleAggregateFunction(sum, UInt64),
    users       AggregateFunction(uniq, UUID),
)
genzgd commented 1 year ago

I don't fully understand what you are trying to do. There is nothing in Superset or Python that can work with ClickHouse aggregate function state data, so deserializing that data in Python or Superset would just create an internal C structure that can't be displayed or processed (at least without a LOT of extra work somewhere). For example, an AggregateFunction data structure for uniq is a HyperLogWithSmallSetOptimization template from C++. That can container either "Small" set of Keys or a "Large" HyperLogLogCounter which has its own internal structure.

It shouldn't be a surprise that Superset cannot work with that data, as it has no code for HyperLogLog data, and there's not standardized format for it. If you want to work with aggregates at different levels you will need to use multiple ClickHouse queries with the appropriate *Merge combinators.

Finally, as a practical problem there is no specification for the format of the many ClickHouse AggregateFunctions, so to deserialize it each binary representation would have to reverse engineered from the C++ code.

Valenzione commented 1 year ago

In Superset you can formulate a custom SQL metric with any SQL Clickhouse supports.

If you want to work with aggregates at different levels you will need to use multiple ClickHouse queries with the appropriate *Merge combinators.

I do want to do as you said. I want to write an SQL metric in superset with -Merge combinator , so Superset (not me) will be able to formulate multiple queries with -Merge combinator

For example:

I have a table:

     hour             DateTime,
     platform         LowCardinality(String),
     count_state      SimpleAggregateFunction(sum, UInt64),
     users_state      AggregateFunction(uniq, UUID),

I want to create a dataset in Superset like this:

 hour  DateTime
 platform String
 count sum(count_state)
 users uniqMerge(users_state)

So Superset will be able to formulate many queries:

  1. uniqMerge(users_state) GROUP BY toStartOfMonth(hour)
  2. uniqMerge(users_state) GROUP BY toStartOfDay(hour)
  3. uniqMerge(users_state) GROUP BY toStartOfMonth(hour), platform
  4. etc.

I see that you mean by saying you can't deserialize AggregateFunction columns, but I don't see the need to actually deserialize them for the use case I mentioned above. I don't need AggregateFunction columns per se, I need them to be visible in the dataset so I could create an SQL metric with -Merge combinator.

Valenzione commented 1 year ago

It shouldn't be a surprise that Superset cannot work with that data, as it has no code for HyperLogLog data, and there's not standardized format for it.

I want to emphasise that Superset doesn't need to work with that data at any point, but Superset needs to be able to work with that column metadata to be able to import it into dataset successfully.

Superset can formulate queries for Clickhouse to handle after column is registered in a dataset. So, in the end only Clickhouse is responsible for actually working with data in AggregateFunction columns, and Superset is responsible only for showing AggregateFunction columns in a dataset and being able to formulate valid queries to them.

genzgd commented 1 year ago

I think I understand what you mean. Any actual query that tries to use those columns will fail, but it might be possible to assign a SQLAlchemy type to AggregateFunction that doesn't throw an exception. It seems fragile to create a type that doesn't actually work, but I'll take a deeper look when I have a chance.

Valenzione commented 1 year ago

Thanks for reconsideration!

It seems fragile to create a type that doesn't actually work

It's indeed fragile - Superset has no actual option to hide such columns to prevent users from using them. However I don't think it's disastrous. Worst case - if somebody tries to use SUM(AggregateFunction) during chart creation he will receive an exception from Clickhouse.

In my opinion, being able to use AggregateFunction data types and -Merge metrics in Superset is a valuable example of how Clickhouse can be advantageous for BI purposes.