Open alamb opened 3 years ago
I've made a little PR for this. But I'm not sure about how to go about measuring the performance improvements... @alamb do you know of any existing benches in the codebase that would measure this?
alternatively i guess i could add a new bench but i remember you saying you were weary about adding too many new benches 😅
To my knowledge, the current TPC-H and TPC-DS benchmarks do not include this scenario. While the data may be dictionary encoded in storage format, they are expanded (IIRC) to normal arrays after being read into memory.
I agree with @waynexia that this scenario is not covered by any existing datafusion benchmarks I know of
Clickbench has several queries that include count distinct (see for example https://github.com/apache/arrow-datafusion/issues/5276#issuecomment-1432070491) but I am not sure if the input is dictionary encoded.
> CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 'hits.parquet';
> SELECT "RegionID", SUM("AdvEngineID"), COUNT(*) AS c, AVG("ResolutionWidth"), COUNT(DISTINCT "UserID") FROM hits GROUP BY "RegionID" ORDER BY c DESC LIMIT 10;
However, I think with #5166 you could now create a dictionary encoded version with a command like the following (untested as I don't not to have the data downloaded -- data is here https://github.com/ClickHouse/ClickBench/tree/main#data-loading)
CREATE TABLE hits_dictionary as
select
arrow_cast("RegionID", 'Dictionary(Int32, Utf8)') as "RegionID",
"ResolutionWidth",
"UserID",
FROM hits;
I have one question for the parquet stored data. When the arrow parquet reader read data from parquet files, I remember even in parquet the files, the data is dictionary encoded(string types with low cardinality), the arrow parquet reader will not convert it to Arrow Dictionary type. @liukun4515 Could you please help to confirm this ?
When the arrow parquet reader read data from parquet files, I remember even in parquet the files, the data is dictionary encoded(string types with low cardinality), the arrow parquet reader will not convert it to Arrow Dictionary type.
I believe @tustvold changed this a while ago so that the Dictionary is preserved (and then I think datafusion hydrates them) -- https://github.com/apache/arrow-rs/pull/1180
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I have large amounts of low cardinality string data (for example, 200 M rows, but only 20 distinct values). DictionaryArrays are very good for such data as they are space efficient.
https://github.com/apache/arrow-datafusion/pull/256 adds basic query support for distinct dictionary columns but it is not a very computationally efficient imlementation. It effectively unpacks the (likely mostly deduplicated) dictionary's values row by row into a hash set to deduplicate it again. That is a lot of extra hashing work.
Describe the solution you'd like It would likely be much more efficient (especially for arrays that have a small number of distinct values in their dictionary) to look at the values from the dictionary directly, first checking that each entry in the dictionary was actually used.