Kyligence / ClickHouse

ClickHouse® is a free analytics DBMS for big data
https://clickhouse.com
Apache License 2.0
14 stars 17 forks source link

There is data skew in hash shuffle #405

Closed lgbo-ustc closed 1 year ago

lgbo-ustc commented 1 year ago

You have to provide the following information whenever possible.

Describe what's wrong

A clear and concise description of what works not as it is supposed to.

A link to reproducer in https://fiddle.clickhouse.com/.

We run a aggregation on high cardinality keys, and found data skew. We notice that the hash function's behavior is different in Clickhouse and Spark in dealing with nulls. In spark

select hash(null);
+-------------+
| hash(NULL)  |
+-------------+
| 42          |
+-------------+
1 row selected (0.089 seconds)

select hash(1,null);
+----------------+
| hash(1, NULL)  |
+----------------+
| -559580957     |
+----------------+
1 row selected (0.129 seconds)

In clickhouse

SELECT cityHash64(1, NULL)

┌─cityHash64(1, NULL)─┐
│ ᴺᵁᴸᴸ                │
└─────────────────────┘

When the hash keys have nulls, it will cause data skew easly.

Does it reproduce on recent release?

The list of releases

Enable crash reporting

If possible, change "enabled" to true in "send_crash_reports" section in config.xml:

<send_crash_reports>
        <!-- Changing <enabled> to true allows sending crash reports to -->
        <!-- the ClickHouse core developers team via Sentry https://sentry.io -->
        <enabled>false</enabled>

How to reproduce

Run a aggregation query on on high cardinality keys, and the keys has nulls.

Expected behavior

A clear and concise description of what you expected to happen.

Error message and/or stacktrace

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.