apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.56k stars 1.4k forks source link

Efficient storage for several INT_8 and INT_16 #2034

Open asfimport opened 7 years ago

asfimport commented 7 years ago

In very large datasets, aggregating several INT8 into INT32 fields (or byte array) can make a big difference. In parquet, efficient algorithms exist for INT32, so if the LogicalType is INT_8 the encoded int might take up only one byte.

However further optimizations could be made by allowing the user to better specify the types. What about BYTE_ARRAY logical type, backed by FIXED_LEN_BYTE_ARRAY type (or eventually INT_32)?

Reporter: Fernando Pereira

Note: This issue was originally created as PARQUET-845. Please see the migration documentation for further details.

asfimport commented 7 years ago

Uwe Korn / @xhochy: Storagewise it should not make a difference whether you would have an INT8 or an INT32 physical type. Putting 4 INT8s into a single INT32 actually would decrease Parquet's efficiency as some of the encoding "tricks" aren't as effective anymore. (Usually my INT8 columns takes less than a bit per row when stored in Parquet. )

Or are you maybe talking about a particular API that should return INT8s instead of INT32s?

asfimport commented 6 years ago

Fernando Pereira: I'm coming back to this issue, so hopefully we close it either as invalid or as a feature request :) In terms of Logical types, we have INT8, INT16, etc, which sounds fine for me.

My question was regarding efficient storage, and whether parquet already chooses efficient encoders by default. (?) If the user doesn't specify any encoding, does parquet-cpp use the most advanced Delta Encodings for e.g. INT8 logical types? Are there situations it falls back to the PLAIN encoding and uses 32 physical bits? Same for a field which is Array of INT8. It is gonna use any run-length encoder? [This was the initial question actually]

PS: my question targets especially parquet-cpp, even though I am interested in the "standard" too. Thanks so much

asfimport commented 6 years ago

Ryan Blue / @rdblue: Parquet will use an efficient encoding by default and you can see the final size by looking at the file metadata in parquet-cli or the older parquet-tools. As Uwe said, most of the time Parquet will get compression that is much better than storing individual values. However, Parquet doesn't use delta encoding by default at the moment because that isn't a finished (in terms of the Parquet spec) encoding. It will use dictionary encoding and run-length encoding, but it currently just uses plain encoding and generic compression for integer pages. This is a good balance of compression and the cost to encode/decode so we haven't had much trouble with it.

I'm interested in using delta encoding by default, but there is some work that needs to be done before we get there, and we will need rules to decide when to use it and when to fall back to plain. You're welcome to contribute here.

With that in mind, the answer to whether Parquet chooses efficient encoders by default is: Yes, Parquet chooses efficient encoders. If instead your question is really whether delta encoding is used, then, no, Parquet will not use delta encoding by default.

asfimport commented 6 years ago

Fernando Pereira: Great, thanks for the clarification! I would be happy to contribute! Would you mind explain better the "work that needs to be done before we get there"?

asfimport commented 6 years ago

Ryan Blue / @rdblue: The main blocker for delta encoding is that we haven't finalized the spec for the set of 2.0 encodings, which means that current releases will be backward-compatible, but we don't guarantee forward-compatibility if you use the current set of 2.0 encodings. In practice, if you upgrade to a new version in the future you might start writing files that aren't supported by current readers. (But we do guarantee that new readers will be able to read files written by older ones.) To make that forward-compatibility guarantee, we want to lock down what writers should produce.

What writers should produce for delta encoding is still undecided. The delta encoding implementation isn't based on the RLE encoding (a combination of bit packing and run-length encoding) that Parquet uses in a lot of places because the RLE encoding doesn't support negative integers. Instead, it is a complicated custom encoding. I've proposed an alternative: zig-zag encode and then use the existing RLE encoding to support negative numbers, and then layer deltas on top of that. Those encodings are in a branch: https://github.com/rdblue/parquet-mr/commit/89b4f16bdfd3817ece42049748745a3b22b83335

I think the current blocker is for people to get time to evaluate the encodings and discuss it somewhere to decide. If you'd like to test out the encodings and push on this issue, that would be a great place to help out. Thanks!

asfimport commented 6 years ago

Ryan Blue / @rdblue: Here's my initial write-up of the encodings I'm proposing: https://lists.apache.org/thread.html/8fc11a8e1538b477162eed2a89946e49dbdcf595b5c7fbe80533432d@%3Cdev.parquet.apache.org%3E