We run a aggregation on high cardinality keys, and found data skew.
We notice that the hash function's behavior is different in Clickhouse and Spark in dealing with nulls.
In spark
If possible, change "enabled" to true in "send_crash_reports" section in config.xml:
<send_crash_reports>
<!-- Changing <enabled> to true allows sending crash reports to -->
<!-- the ClickHouse core developers team via Sentry https://sentry.io -->
<enabled>false</enabled>
Describe what's wrong
We run a aggregation on high cardinality keys, and found data skew. We notice that the
hash
function's behavior is different inClickhouse
andSpark
in dealing with nulls. In sparkIn clickhouse
When the hash keys have nulls, it will cause data skew easly.
Does it reproduce on recent release?
The list of releases
Enable crash reporting
How to reproduce
CREATE TABLE
statements for all tables involvedRun a aggregation query on on high cardinality keys, and the keys has nulls.
Expected behavior
Error message and/or stacktrace
Additional context