Delta Encoding is not used when dictionary enabled

asfimport commented 7 years ago

Current code doesn't enable using both Delta Encoding and Dictionary Encoding. If I instantiate ParquetWriter like this :

val writer = new ParquetWriter[Group](outFile, new GroupWriteSupport, codec, blockSize, pageSize, dictPageSize, enableDictionary = true, true, ParquetProperties.WriterVersion.PARQUET_2_0, configuration)

Then this piece of code : https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java#L78-L86

Causes that DictionaryValuesWriter is used instead of the inferred DeltaLongEncodingWriter.

The original issue is here : https://github.com/apache/parquet-mr/pull/154#issuecomment-266489768

Reporter: Jakub Liska

_{Note: This issue was originally created as PARQUET-796. Please see the migration documentation for further details.}

asfimport commented 7 years ago

Ryan Blue / @rdblue: Dictionary encoding usually produces better results than delta encoding. But, the dictionary fall-back is based on what the plain encoding would do, so it is biased toward dictionary encoding. Do you think that the data would be smaller with delta rather than dictionary?

asfimport commented 7 years ago

Jakub Liska: Certainly, it is a timestamp column, so the values have very high cardinality, almost each value is unique unless there are records that occurred within the same millisecond. So the delta encoding is the only suitable encoding for this column I guess.

asfimport commented 7 years ago

Jakub Liska: I'm really confused, it looks like the boolean enableDictionary determines whether dictionary encoding is used for all columns or none of them.

@rdblue I think that the correct behavior should be prioritizing Dictionary encoding only for Enums :

if (type == OriginalType.ENUM && parquetProperties.isEnableDictionary())

Not for all columns ... Unfortunately the OriginalType information is not available in there.

asfimport commented 7 years ago

Jakub Liska: Good thing is that, if I make the INT64 as OriginalType.TIMESTAMP_MILLIS, then it Delta Encoding is used, so it all ended well for my purpose.

I just cannot find the place where this is determined in the parquet codebase, so that for a newcomer it is sort of hard to know how to use a particular encoding for a column.

So far I learnt that : ENUM = RLE_DICTIONARY TIMESTAMP_MILLIS = DELTA_BINARY_PACKED others like BINARY as UTF8 = DELTA_BYTE_ARRAY

asfimport commented 6 years ago

Hans Brende: Hi, I made a couple of comments related to this issue here: https://github.com/apache/parquet-mr/pull/342#issuecomment-368341324

Any timeline on when this issue will be resolved? I'd be glad to fix it myself if you want.

Also, @rdblue, why not just ditch the parquet.enable.dictionary setting altogether and make the condition:


if (type == OriginalType.ENUM) {
    //use dictionary encoding
} else {
    //don't use dictionary encoding
}

asfimport commented 6 years ago

Ryan Blue / @rdblue: I don't recommend using the delta long encoding because I think we need to update to better encodings (specifically, the zig-zag-encoding ones in this branch).

We could definitely use a better fallback, but I don't think the solution is to turn off dictionary encoding. If you can use dictionary encoding to get a smaller size, you should. The problem is when dictionary encoding needs to test whether another encoding would be better. It currently tests against plain and uses plain. We should have it test against a delta encoding and use one.

This kind of improvement is why we added PARQUET-601. We want to be able to test out different ways of choosing an encoding at write time. But we do not want to make it so that users must specify their own encodings because we want Parquet to select them automatically and get the choice right. PARQUET-601 is about testing out strategies that we release as the defaults.

asfimport commented 6 years ago

Hans Brende: @rdblue I agree with everything you said--sounds like a good strategy!

But before this strategy is implemented, I am still forced to implement hackish solutions to manually turn off dictionary encoding for specific columns.

Any idea on some sort of timeline of when your solution will be implemented? I'd be glad to help if there's anything I can do.

apache / parquet-java

Delta Encoding is not used when dictionary enabled #2024