apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.69k stars 422 forks source link

[Format] Expand BYTE_STREAM_SPLIT to support FIXED_LEN_BYTE_ARRAY, INT32 and INT64 #296

Closed asfimport closed 3 months ago

asfimport commented 6 months ago

In PARQUET-1622 we added the BYTE_STREAM_SPLIT encoding which, while simple to implement, allows to significantly improve compression efficiency on FLOAT and DOUBLE columns.

In PARQUET-758 we added the FLOAT16 logical type which annotates a 2-byte-wide FIXED_LEN_BYTE_ARRAY column to denote that it contains 16-bit IEEE binary floating-point (colloquially called "half float").

This issue proposes to widen the types supported by the BYTE_STREAM_SPLIT encoding. By allowing the BYTE_STREAM_SPLIT encoding on any FIXED_LEN_BYTE_ARRAY column, we can automatically improve compression efficiency on various column types including:

Here are the compression measurements using the same methodology as above. The number of BYTE_STREAM_SPLIT streams is the respective byte width of each FLBA column (i.e., 4 for latitudes and 5 for longitudes).


<u>---------</u>---------------~~+~~-----~~+~~---------~~+~~--------~~+~~------------~~+~~------~~+~~----------~~+~~---------------~~+~~------------------~~+~~-----------------+

| name    |   uncompressed |   lz4 |   bss_lz4 |   snappy |   bss_snappy |   zstd |   bss_zstd |   bss_ratio_lz4 |   bss_ratio_snappy |   bss_ratio_zstd |
<br><u>=========</u>================<u>=======</u>===========<u>==========</u>==============<u>========</u>============<u>=================</u>====================<u>==================</u>
|
|-|-|-|-|-|-|-|-|-|-|-|-|
| min_lat |     4996652.00 |  1.00 |      1.01 |     1.00 |         1.03 |   1.05 |       1.12 |            1.01 |               1.03 |             1.07 |
<br><u>---------</u>---------------~~+~~-----~~+~~---------~~+~~--------~~+~~------------~~+~~------~~+~~----------~~+~~---------------~~+~~------------------~~+~~-----------------+
|
| max_lat |     4996652.00 |  1.00 |      1.01 |     1.00 |         1.03 |   1.05 |       1.13 |            1.01 |               1.03 |             1.07 |
<br><u>---------</u>---------------~~+~~-----~~+~~---------~~+~~--------~~+~~------------~~+~~------~~+~~----------~~+~~---------------~~+~~------------------~~+~~-----------------+
|
| min_lon |     6245825.00 |  1.00 |      1.14 |     1.00 |         1.16 |   1.15 |       1.31 |            1.14 |               1.16 |             1.14 |
<br><u>---------</u>---------------~~+~~-----~~+~~---------~~+~~--------~~+~~------------~~+~~------~~+~~----------~~+~~---------------~~+~~------------------~~+~~-----------------+
|
| max_lon |     6245825.00 |  1.00 |      1.14 |     1.00 |         1.16 |   1.15 |       1.31 |            1.14 |               1.16 |             1.14 |
<br><u>---------</u>---------------~~+~~-----~~+~~---------~~+~~--------~~+~~------------~~+~~------~~+~~----------~~+~~---------------~~+~~------------------~~+~~-----------------+
<br>```Java<br>
<br>
<br>!bss_osm_changesets.png!
<br>
<br>h3. Comments
<br>
<br>On this dataset, compression efficiency is generally quite poor and BYTE_STREAM_SPLIT encoding brings almost no additional efficiency to the table. It can be assumed that OSM changeset entries have geographical coordinates all over the place (literally!) and therefore do not offer many opportunities for compression.
<br>
<br>h2. Decimal data from an OpenStreetMap region
<br>
<br>I've chosen a small region of the world (Belgium) whose geographical coordinates presumably allow for better compression by being much more clustered. The file {{belgium-latest.osm.pbf}} was converted to ORC for easier handling, resulting in a 745 MB ORC file.
<br>
<br>I've then loaded the decimal columns from the first stripe in that file:
<br>```
<br>pyarrow.RecordBatch
<br>lat: decimal128(9, 7)
<br>lon: decimal128(10, 7)
<br>----
<br>lat: [50.4443865,50.4469017,50.4487890,50.4499558,50.4523446,50.4536530,50.4571053,50.4601436,50.4631197,50.4678563,...,51.1055899,51.1106197,51.1049620,51.1047010,51.1104755,51.0997955,51.1058101,51.1010664,51.1014336,51.1055106]
<br>lon: [3.6857362,3.6965046,3.7074481,3.7173626,3.8126033,3.9033178,3.9193678,3.9253319,3.9292409,3.9332670,...,4.6663214,4.6699997,4.6720536,4.6655159,4.6666372,4.6680394,4.6747172,4.6684242,4.6713693,4.6644899]
<br>```Java<br>
<br>
<br>Here are the compression measurements for these columns. As in the previous dataset, the number of BYTE_STREAM_SPLIT streams is the respective byte width of each FLBA column (i.e., 4 for latitudes and 5 for longitudes).
<br>```
<br><u>--------</u>---------------~~+~~-----~~+~~---------~~+~~--------~~+~~------------~~+~~------~~+~~----------~~+~~---------------~~+~~------------------~~+~~-----------------+
|
| name   |   uncompressed |   lz4 |   bss_lz4 |   snappy |   bss_snappy |   zstd |   bss_zstd |   bss_ratio_lz4 |   bss_ratio_snappy |   bss_ratio_zstd |
<br><u>========</u>================<u>=======</u>===========<u>==========</u>==============<u>========</u>============<u>=================</u>====================<u>==================</u>
|
| lat    |    12103680.00 |  1.00 |      1.63 |     1.00 |         1.63 |   1.18 |       1.73 |            1.63 |               1.63 |             1.47 |
<br><u>--------</u>---------------~~+~~-----~~+~~---------~~+~~--------~~+~~------------~~+~~------~~+~~----------~~+~~---------------~~+~~------------------~~+~~-----------------+
|
| lon    |    15129600.00 |  1.00 |      1.93 |     1.00 |         1.90 |   1.27 |       2.06 |            1.93 |               1.90 |             1.62 |
<br><u>--------</u>---------------~~+~~-----~~+~~---------~~+~~--------~~+~~------------~~+~~------~~+~~----------~~+~~---------------~~+~~------------------~~+~~-----------------+
<br>{code}
<br>
<br> ![bss_osm_belgium.png](https://issues.apache.org/jira/secure/attachment/13065326/bss_osm_belgium.png)
<br>
<br>### Comments
<br>
<br>This dataset shows that a BYTE_STREAM_SPLIT encoding before compression achieves a very significant additional efficiency compared to compression alone.
<br>
<br>## Integer data from two OpenStreetMap data files
<br>
<br>I also tried to evaluate the efficiency of BYTE_STREAM_SPLIT on integer columns (INT32 or INT64). Here, however, another efficient encoding is already available (DELTA_BINARY_PACKED). So the evaluation focussed on comparing BYTE_STREAM_SPLIT + compression against DELTA_BINARY_PACKED alone.
<br>
<br>The comparison was done on the two same OpenStreetMap files as above, using only the first stripe. Here are the measurement results in table format:
<br>
<br>![bss_ints_osm_changesets.png](https://issues.apache.org/jira/secure/attachment/13065402/bss_ints_osm_changesets.png)
<br>
<br>![bss_ints_osm_belgium.png](https://issues.apache.org/jira/secure/attachment/13065406/bss_ints_osm_belgium.png)
<br>
<br>\***Caution**\*: the DELTA_BINARY_PACKED length measurement did not use a real encoder implementation, but a length estimation function written in pure Python. The estimation function should be accurate according to quick tests.
<br>
<br>### Comments
<br>
<br>The results are very heterogeneous, depending on the kind of data those integer columns represent.
<br>
<br>Some columns achieve very good compression ratios, far above 10x, with all methods; for these columns, it does not make sense to compare the compression ratios, since the column sizes will be very small in all cases; performance and interoperability should be the only concerns.
<br>
<br>On other columns, the compression ratios are more moderate and BYTE_STREAM_SPLIT + compression seems to be preferable to DELTA_BINARY_PACKED.
<br>
<br>## Integer data from a PyPI archive file
<br>
<br>I downloaded one of the "index" Parquet files from https://github.com/pypi-data/data/releases and read the first row group.
<br>The measurement results are as follows:
<br>
<br>![bss_ints_pypi.png](https://issues.apache.org/jira/secure/attachment/13065404/bss_ints_pypi.png)
<br>
<br>### Comments
<br>
<br>On this data, BYTE_STREAM_SPLIT + compression is clearly better than DELTA_BINARY_PACKED. The timestamp column ("uploaded_on") in particular shows very strong benefits.
<br>
<br>## Integer data from a NYC "yellow" taxi file
<br>
<br>I downloaded one of the "yellow" taxi trip records from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page and read the first row group. The measurement results are as follows:
<br>
<br>![bss_ints_nyctaxi.png](https://issues.apache.org/jira/secure/attachment/13065403/bss_ints_nyctaxi.png)
<br>
<br>### Comments
<br>
<br>These results are a bit of a mixed bag. Only BYTE_STREAM_SPLIT + zstd is consistenly superior to DELTA_BINARY_PACKED. However, if one focusses on the timestamp columns, then all three general-purpose compressors provide a benefit.
<br>
<br>## Discussion
<br>
<br>When reading these results, it is important to remind that the exact compression ratios do not necessarily matter, as long as the efficiency is high enough. A compressor that achieves 100x compression on a column is not necessarily worse than one that achieves 300x compression on the same column: both are "good enough" on this particular data. On the contrary, when compression ratios are moderate (lower than 10x), they should certainly be compared.
<br>
<br>### Efficiency
<br>
<br>#### Efficiency on FIXED_LEN_BYTE_ARRAY data
<br>
<br>These examples show that extending the BYTE_STREAM_SPLIT encoding to FIXED_LEN_BYTE_ARRAY columns (even regardless of their logical types) can yield very significant compression efficiency improvements on two specific types of FIXED_LEN_BYTE_ARRAY data: FLOAT16 data and DECIMAL data.
<br>
<br>#### Efficiency on INT32 / INT64 data
<br>
<br>Extending the BYTE_STREAM_SPLIT encoding to INT32 and INT64 columns can bring significant benefits over DELTA_BINARY_PACKED. However, whether and by how much depends on the kind of data that is encoded as integers. Timestamps seem to always benefit from BYTE_STREAM_SPLIT encoding. Pairing BYTE_STREAM_SPLIT with zstd also generally achieves higher efficiency than DELTA_BINARY_PACKED.
<br>
<br>Whether to choose BYTE_STREAM_SPLIT + compression over DELTA_BINARY_PACKED will in practice have to be informed by several factors, such as performance expectations and interoperability. Sophisticated writers might also implement some form of sampling to find out the best encoding + compression combination for a given column.
<br>
<br>\***Note**\*: all tested data above is actually INT64. However, given the mechanics of BYTE_STREAM_SPLIT and DELTA_BINARY_PACKED, we can assume that similar results would have been obtained for INT32 data.
<br>
<br>### Performance
<br>
<br>Since BYTE_STREAM_SPLIT only brings benefits in combination with compression, the overall encoding + compression cost should be considered.
<br>
<br>#### Performance on FIXED_LEN_BYTE_ARRAY data
<br>
<br>The choice is between BYTE_STREAM_SPLIT + compression vs. compression alone. Even a non-SIMD optimized version of BYTE_STREAM_SPLIT, such as in Parquet C++, can achieve multiple GB/s; there is little reason to pay the cost of compression but refuse to pay the much smaller cost of the BYTE_STREAM_SPLIT encoding step.
<br>
<br>#### Performance on INT32 / INT64 data
<br>
<br>The choice is between BYTE_STREAM_SPLIT + compression vs. DELTA_BINARY_PACKED alone. DELTA_BINARY_PACKED has a significant performance edge. The current Parquet C++ implementation of DELTA_BINARY_PACKED encodes between 600 MB/s and 2 GB/s, and decodes between 3 and 6 GB/s. This is faster than any of the general-purpose compression schemes available in Parquet, even lz4.
<br>
<br>### Implementation complexity
<br>
<br>BYTE_STREAM_SPLIT, even byte width-agnostic, is almost trivial to implement. A simple implementation can yield good performance with a minimum of work.
<br>
<br>For example, the non-SIMD-optimized BYTE_STREAM_SPLIT encoding and decoding routines in Parquet C++ amount to a mere total of ~200 lines of code, despite explicitly-unrolled loops:
<br>https://github.com/apache/arrow/blob/4e58f7ca0016c2b2d8a859a0c5965df3b15523e0/cpp/src/arrow/util/byte_stream_split_internal.h#L593-L702
|

**Reporter**: [Antoine Pitrou](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=apitrou) / @pitrou
**Assignee**: [Antoine Pitrou](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=apitrou) / @pitrou
#### Related issues:
- [Add BYTE_STREAM_SPLIT support for FIXED_LEN_BYTE_ARRAY, INT32 and INT64](https://github.com/apache/parquet-java/issues/2890) (is depended upon by)
#### Original Issue Attachments:
- [bss_fp16.png](https://issues.apache.org/jira/secure/attachment/13065328/bss_fp16.png)
- [bss_ints_nyctaxi.png](https://issues.apache.org/jira/secure/attachment/13065403/bss_ints_nyctaxi.png)
- [bss_ints_osm_belgium.png](https://issues.apache.org/jira/secure/attachment/13065406/bss_ints_osm_belgium.png)
- [bss_ints_osm_changesets.png](https://issues.apache.org/jira/secure/attachment/13065402/bss_ints_osm_changesets.png)
- [bss_ints_pypi.png](https://issues.apache.org/jira/secure/attachment/13065404/bss_ints_pypi.png)
- [bss_osm_belgium.png](https://issues.apache.org/jira/secure/attachment/13065326/bss_osm_belgium.png)
- [bss_osm_changesets.png](https://issues.apache.org/jira/secure/attachment/13065327/bss_osm_changesets.png)
#### PRs and other links:
- [GitHub Pull Request #229](https://github.com/apache/parquet-format/pull/229)
- [GitHub Pull Request #46](https://github.com/apache/parquet-testing/pull/46)

<sub>**Note**: *This issue was originally created as [PARQUET-2414](https://issues.apache.org/jira/browse/PARQUET-2414). Please see the [migration documentation](https://issues.apache.org/jira/browse/PARQUET-2502) for further details.*</sub>
asfimport commented 6 months ago

Antoine Pitrou / @pitrou: cc @anjakefala   @gszadovszky @wgtmac @martinradev  

asfimport commented 6 months ago

Gang Wu / @wgtmac: The experiment result looks promising!

BTW, I have two questions:

  1. Should we limit the extension to only FLOAT16 and DECIMAL logical types? I admit that it would be much simpler to support all FLBA logical types, but other non-numeric logical types are harder to predict the gain.
  2. Should we extend it to support decimal of INT32 and INT64 physical types? I would expect similar gain.
asfimport commented 6 months ago

Gabor Szadovszky / @gszadovszky: Thanks a lot for working on his, @pitrou,

I agree with @wgtmac: If we support FIXED_LEN_BYTE_ARRAY(DECIMAL) why wouldn't we do so for the INT32 and INT64 representations. I think, from spec point of view, we are fine extending BYTE_STREAM_SPLIT for additional types. The question is how broad is this encoding supported. parquet-mr already supports turning it on for FP types manually. Do we want to keep it manually switchable for the writers for now? (We might need a more sophisticated approach for the switch...)

asfimport commented 6 months ago

Antoine Pitrou / @pitrou:

Should we limit the extension to only FLOAT16 and DECIMAL logical types?

I think that's a reasonable choice for writers to do, but I'm not sure the spec should mandate it.

Should we extend it to support decimal of INT32 and INT64 physical types? I would expect similar gain.

Those two types can use DELTA_BINARY_PACKED, which should generally give very good results. I have no idea whether BYTE_STREAM_SPLIT + compression could be better in some cases.

asfimport commented 6 months ago

Antoine Pitrou / @pitrou:

Do we want to keep it manually switchable for the writers for now? (We might need a more sophisticated approach for the switch...)

IMHO the only downside with enabling it always is compatibility with older readers. Otherwise, I would say the choice is a no-brainer.

asfimport commented 6 months ago

Antoine Pitrou / @pitrou: Ok, I've run some tests on INT32 / INT64 and it turns out that there are some benefits in some (not all cases). See updated text.

asfimport commented 6 months ago

Micah Kornfield / @emkornfield: This seems like a good change to me.

asfimport commented 5 months ago

Antoine Pitrou / @pitrou: I've opened a PR to parquet-format in https://github.com/apache/parquet-format/pull/229

asfimport commented 3 months ago

Antoine Pitrou / @pitrou: The VOTE thread is now open at https://lists.apache.org/thread/nlsj0ftxy7y4ov1678rgy5zc7dmogg6q

@wesm @rdblue You both opined on the original BYTE_STREAM_SPLIT vote, would you like to give your opinion on whether to extend the encoding's applicability as proposed as the thread I linked above? (please do not feel pressured if you have no interest in this!)

asfimport commented 3 months ago

Antoine Pitrou / @pitrou: The VOTE thread passes successfully at https://lists.apache.org/thread/4mof6ghglxzkvtxxmfc206s5g5d7f8zy

asfimport commented 3 months ago

Antoine Pitrou / @pitrou: The format and testing additions are now merged, so this issue is resolved.