Closed asfimport closed 3 months ago
Antoine Pitrou / @pitrou: cc @anjakefala @gszadovszky @wgtmac @martinradev
Gang Wu / @wgtmac: The experiment result looks promising!
BTW, I have two questions:
Gabor Szadovszky / @gszadovszky: Thanks a lot for working on his, @pitrou,
I agree with @wgtmac: If we support FIXED_LEN_BYTE_ARRAY(DECIMAL) why wouldn't we do so for the INT32 and INT64 representations. I think, from spec point of view, we are fine extending BYTE_STREAM_SPLIT for additional types. The question is how broad is this encoding supported. parquet-mr already supports turning it on for FP types manually. Do we want to keep it manually switchable for the writers for now? (We might need a more sophisticated approach for the switch...)
Antoine Pitrou / @pitrou:
Should we limit the extension to only FLOAT16 and DECIMAL logical types?
I think that's a reasonable choice for writers to do, but I'm not sure the spec should mandate it.
Should we extend it to support decimal of INT32 and INT64 physical types? I would expect similar gain.
Those two types can use DELTA_BINARY_PACKED, which should generally give very good results. I have no idea whether BYTE_STREAM_SPLIT + compression could be better in some cases.
Antoine Pitrou / @pitrou:
Do we want to keep it manually switchable for the writers for now? (We might need a more sophisticated approach for the switch...)
IMHO the only downside with enabling it always is compatibility with older readers. Otherwise, I would say the choice is a no-brainer.
Antoine Pitrou / @pitrou: Ok, I've run some tests on INT32 / INT64 and it turns out that there are some benefits in some (not all cases). See updated text.
Micah Kornfield / @emkornfield: This seems like a good change to me.
Antoine Pitrou / @pitrou: I've opened a PR to parquet-format in https://github.com/apache/parquet-format/pull/229
Antoine Pitrou / @pitrou: The VOTE thread is now open at https://lists.apache.org/thread/nlsj0ftxy7y4ov1678rgy5zc7dmogg6q
@wesm @rdblue You both opined on the original BYTE_STREAM_SPLIT vote, would you like to give your opinion on whether to extend the encoding's applicability as proposed as the thread I linked above? (please do not feel pressured if you have no interest in this!)
Antoine Pitrou / @pitrou: The VOTE thread passes successfully at https://lists.apache.org/thread/4mof6ghglxzkvtxxmfc206s5g5d7f8zy
Antoine Pitrou / @pitrou: The format and testing additions are now merged, so this issue is resolved.
In PARQUET-1622 we added the BYTE_STREAM_SPLIT encoding which, while simple to implement, allows to significantly improve compression efficiency on FLOAT and DOUBLE columns.
In PARQUET-758 we added the FLOAT16 logical type which annotates a 2-byte-wide FIXED_LEN_BYTE_ARRAY column to denote that it contains 16-bit IEEE binary floating-point (colloquially called "half float").
This issue proposes to widen the types supported by the BYTE_STREAM_SPLIT encoding. By allowing the BYTE_STREAM_SPLIT encoding on any FIXED_LEN_BYTE_ARRAY column, we can automatically improve compression efficiency on various column types including:
half-float data
fixed-width decimal data
Also, by allowing the BYTE_STREAM_SPLIT encoding on any INT32 or INT64 column, we can improve compression efficiency on further column types such as timestamps.
I've run compression measurements on various pieces of sample data which I detail below.
Float16 data
I've downloaded the sample datasets from https://userweb.cs.txstate.edu/~burtscher/research/datasets/FPsingle/ , uncompressed them and converted them to half-float using NumPy. Two files had to be discarded because of overflow when converting to half-float.
I've then run three different compression algorithms (lz4, zstd, snappy), optionally preceded by a BYTE_STREAM_SPLIT encoding with 2 streams (corresponding to the byte width of the FLBA columns. Here are the results:
Explanation:
the columns "lz4", "snappy", "zstd" show the compression ratio achieved with the respective compressors (i.e. uncompressed size divided by compressed size)
the columns "bss_lz4", "bss_snappy", "bss_zstd" are similar, but with a BYTE_STREAM_SPLIT encoding applied first
the columns "bss_ratio_lz4", "bss_ratio_snappy", "bss_ratio_zstd" show the additional compression ratio achieved by prepending the BYTE_STREAM_SPLIT encoding step (i.e. PLAIN-encoded compressed size divided by BYTE_STREAM_SPLIT-encoded compressed size).
(reference) Float32 data
For reference, here are the measurements for the original single-precision floating-point data.
Comments
The additional efficiency of the BYTE_STREAM_SPLIT encoding step is very significant on most files (except
obs_temp.sp
which generally doesn't compress at all), with additional gains usually around 30%.The BYTE_STREAM_SPLIT encoding is, perhaps surprisingly, on average as beneficial on Float16 data as it is on Float32 data.
Decimal data from OpenStreetMap changesets
I've downloaded one of the recent OSM changesets file
changesets-231030.orc
, and loaded the four decimal columns from the first stripe of that file. Those columns look like: {code} pyarrow.RecordBatch min_lat: decimal128(9, 7) max_lat: decimal128(9, 7) min_lon: decimal128(10, 7) max_lon: decimal128(10, 7)min_lat: [51.5288506,51.0025063,51.5326805,51.5248871,51.5266800,51.5261841,51.5264130,51.5238914,59.9463692,59.9513092,...,50.8238277,52.1707376,44.2701598,53.1589748,43.5988333,37.7867167,45.5448822,null,50.7998334,50.5653478] max_lat: [51.5288620,51.0047760,51.5333176,51.5289383,51.5291901,51.5300598,51.5264130,51.5238914,59.9525642,59.9561501,...,50.8480772,52.1714300,44.3790161,53.1616817,43.6001496,37.7867913,45.5532716,null,51.0188961,50.5691352] min_lon: [-0.1465242,-1.0052705,-0.1566335,-0.1485492,-0.1418076,-0.1550623,-0.1539768,-0.1432930,10.7782278,10.7719727,...,10.6863813,13.2218676,19.8840738,8.9128186,1.4030591,-122.4212761,18.6789571,null,-4.2085209,8.6851671] max_lon: [-0.1464925,-0.9943439,-0.1541054,-0.1413791,-0.1411505,-0.1453212,-0.1539768,-0.1432930,10.7898550,10.7994537,...,10.7393494,13.2298706,20.2262343,8.9183611,1.4159345,-122.4212503,18.6961594,null,-4.0496079,8.6879264]
Here are the compression measurements using the same methodology as above. The number of BYTE_STREAM_SPLIT streams is the respective byte width of each FLBA column (i.e., 4 for latitudes and 5 for longitudes).