apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.1k stars 1.15k forks source link

Improve performance of COUNT (distinct x) for dictionary columns #258

Open alamb opened 3 years ago

alamb commented 3 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I have large amounts of low cardinality string data (for example, 200 M rows, but only 20 distinct values). DictionaryArrays are very good for such data as they are space efficient.

https://github.com/apache/arrow-datafusion/pull/256 adds basic query support for distinct dictionary columns but it is not a very computationally efficient imlementation. It effectively unpacks the (likely mostly deduplicated) dictionary's values row by row into a hash set to deduplicate it again. That is a lot of extra hashing work.

Describe the solution you'd like It would likely be much more efficient (especially for arrays that have a small number of distinct values in their dictionary) to look at the values from the dictionary directly, first checking that each entry in the dictionary was actually used.

jaylmiller commented 1 year ago

I've made a little PR for this. But I'm not sure about how to go about measuring the performance improvements... @alamb do you know of any existing benches in the codebase that would measure this?

alternatively i guess i could add a new bench but i remember you saying you were weary about adding too many new benches 😅

waynexia commented 1 year ago

To my knowledge, the current TPC-H and TPC-DS benchmarks do not include this scenario. While the data may be dictionary encoded in storage format, they are expanded (IIRC) to normal arrays after being read into memory.

alamb commented 1 year ago

I agree with @waynexia that this scenario is not covered by any existing datafusion benchmarks I know of

Clickbench has several queries that include count distinct (see for example https://github.com/apache/arrow-datafusion/issues/5276#issuecomment-1432070491) but I am not sure if the input is dictionary encoded.

> CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 'hits.parquet';

> SELECT "RegionID", SUM("AdvEngineID"), COUNT(*) AS c, AVG("ResolutionWidth"), COUNT(DISTINCT "UserID") FROM hits GROUP BY "RegionID" ORDER BY c DESC LIMIT 10;

However, I think with #5166 you could now create a dictionary encoded version with a command like the following (untested as I don't not to have the data downloaded -- data is here https://github.com/ClickHouse/ClickBench/tree/main#data-loading)

CREATE TABLE hits_dictionary as 
select 
  arrow_cast("RegionID", 'Dictionary(Int32, Utf8)') as "RegionID",
  "ResolutionWidth",
  "UserID",
FROM hits;
mingmwang commented 1 year ago

I have one question for the parquet stored data. When the arrow parquet reader read data from parquet files, I remember even in parquet the files, the data is dictionary encoded(string types with low cardinality), the arrow parquet reader will not convert it to Arrow Dictionary type. @liukun4515 Could you please help to confirm this ?

alamb commented 1 year ago

When the arrow parquet reader read data from parquet files, I remember even in parquet the files, the data is dictionary encoded(string types with low cardinality), the arrow parquet reader will not convert it to Arrow Dictionary type.

I believe @tustvold changed this a while ago so that the Dictionary is preserved (and then I think datafusion hydrates them) -- https://github.com/apache/arrow-rs/pull/1180