apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.44k stars 3.52k forks source link

[C++] Support reading Hadoop-snappy File Format Directly #36608

Open shanjixi opened 1 year ago

shanjixi commented 1 year ago

Describe the enhancement requested

Hadoop-snappy File Format is widely used for BigData and ML frameworks; We use spark or hadoop mr to preprocess data for furture processing such like Data warehouse and DeepLearning training;

Hadoop-snappy File is a compressed file consists of one or more blocks. A block consists of uncompressed length (big endian 4 byte integer) and one or more subblocks.

Arrow could read a file which is fully compressed with Snappy Codec but fails to read Hadoop-snappy File Format;

If arrow can read the Hadoop-snappy File consisted of a serials of chunks, we can use the output from Spark directly and save the HDFS storage cost. All directly reading Haoop-snappy file will greatly extend the Arrow's using scenarios~

Within bytedance.com, almost 40% files are compressed by Snappy and stored in HDFS filesystem; we use arrow to read Hadoop-snappy File and Hadoop-Zstd file which save computing resources greatly~

Component(s)

C++

mapleFU commented 1 year ago

Would you mind provide something about "hadoop-snappy"? Do you mean https://github.com/kubo/snzip#hadoop-snappy-format ? By the way, Arrow already has snappy codec support and HDFS datasource support.

shanjixi commented 1 year ago

Yes, I mean enable Arrow reading hadoop-snappy file format directly, just the same as kubo/snip's behavior.

Would you mind provide something about "hadoop-snappy"? Do you mean https://github.com/kubo/snzip#hadoop-snappy-format ? By the way, Arrow already has snappy codec support and HDFS datasource support.

westonpace commented 1 year ago

Yes, I think we will need a link or more information. It is not clear to me if this is a new compression codec (some kind of block-based snappy) or file format or some change in the hdfs filesystem implementation.

shanjixi commented 1 year ago

some change in the hdfs filesystem implementation.

In Hadoop common it's named as BlockDecompressorStream with SnappyCodeC Here is the link for hadoop-snappy: https://code.google.com/archive/p/hadoop-snappy/

mapleFU commented 1 year ago

Seems it's just SnappyCodec? Can you just try the:

  1. https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/compression.h
  2. or https://arrow.apache.org/docs/python/generated/pyarrow.compress.html
shanjixi commented 1 year ago

Seems it's just SnappyCodec? Can you just try the:

  1. https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/compression.h
  2. or https://arrow.apache.org/docs/python/generated/pyarrow.compress.html

This is not block based compress/unpresss support. If we read a hadoop-snappy files, we need to uncompress blocks one by one, which is not convenient enought for high level components;

If we read blocks from one Haddoop-Snappy file, definitely we do need to uncompress one block with the help of "arrow/util/compression.h"; However we still need to wright extra code to provide a StreamingView for a serial of uncompressed snappy blocks for hight level component.

westonpace commented 1 year ago

Are you reading CSV files? If not, then I am not sure that a streaming view will be very helpful. We generally rely on random access to metadata. This allows us to implement things like column skipping (projection pushdown). This is why formats like the arrow format and the parquet format support buffer-compression (instead of whole-file compression). It allows the metadata to stay uncompressed.

When you talk about blocks are you talking about the snappy framing format? https://github.com/google/snappy/blob/main/framing_format.txt

shanjixi commented 1 year ago

When you talk about blocks are you talking about the snappy framing format? https://github.com/google/snappy/blob/main/framing_format.txt

NOT CSV files. We are reading some kind of TFRecod file(has many column's like csv for machine learning) which is commpressed into Hadoop-Snappy format such like training_instance_0001.tfrecord.snappy

There are 3 kind of snappy related compressed file format (not including the parquet use snappy internally) , and this issue is about kind-3: "hadoop-snappy file"

1.snappy whole file 2.snappy framing (google, rarely used.) 3.hadoop-snappy file (whichi is widely useed in big-data ecosystem such as Hadoop-MR, Spark, Flink Batch ..)

2. is different from 3. we can treat 3. as a kind of general blocked based compressed format( the codec for the blocks could able replaced by ZSTD, gzip and so on;)

pitrou commented 1 year ago

@shanjixi I'm curious, what does Arrow bring to the table if the file format (TFR) is not supported by Arrow?

pitrou commented 1 year ago

(I'm horrified that there is even a Hadoop-ZSTD)

shanjixi commented 1 year ago

supported

TFRecord is just like a file contains multi-rows(each row contains multi Features like cols) In practice,we use a converter to transform TFRecord file (compressed) into arrow::Table Type. Then we apply Arrow filter/take/list_flatten functions before we send the Data to Tensorflow worker;

Arrow is used to read data from HDFS, make further processing and avoid memroy copy cost between these two steps;

What's more, within our company both Hadoop-ZSTD and Hadoop-SNAP are supported to be read with Arrow directly( we call them as zstd_decompress_inputstream and snappy_decompress_inputstream.

pitrou commented 1 year ago

Ok, see https://github.com/apache/arrow/blob/main/cpp/src/arrow/io/compressed.h

You could pretty easily add a HadoopCompressedInputStream there.