apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.89k stars 3.38k forks source link

[C++][Parquet] Process parquet rowgroups without Arrow conversion #35638

Open alippai opened 1 year ago

alippai commented 1 year ago

Describe the usage question you have. Please include as many useful details as possible.

I'd like to read a Parquet file and append an Arrow table to the new Parquet file created based on the old file and the new table added as a new row group. Can I read the Parquet rowgroup by rowgroup, decide to drop any or use them and assemble a new Parquet file without doing the (de)serialization to Arrow?

Component(s)

C++, Parquet, Python

mapleFU commented 1 year ago

Seems that you want a "append" syntax, and want to avoid read->covert to arrow->writeback?

I guess current Parquet code cannot support this :-(

alippai commented 1 year ago

Yes, something like that. My usecase is writing data to a small parquet file daily, changing the last 3 days. I don’t have exact numbers to support this extra api yet, but wanted to ask first.

I can imagine this is not a common case to keep/drop row groups based on the stats or append new row groups - feel free to close the issue, please

alippai commented 1 year ago

Speaking of this... is it a good practice to use row groups instead of hive partitions or is that considered an anti-pattern when speaking of parquet? Would that be a good addition to pyarrow dataset to optionally ensure the parquet rowgroups contains only one partition?

westonpace commented 1 year ago

My usecase is writing data to a small parquet file daily, changing the last 3 days. I don’t have exact numbers to support this extra api yet, but wanted to ask first.

I can imagine this is not a common case to keep/drop row groups based on the stats or append new row groups - feel free to close the issue, please

I would say that it is a very common thing for users to want to do. However, parquet is often not the correct layer of abstraction to introduce this capability. For example, the table formats like Iceberg, Delta Lake, and Hudi have all come up with ways to handle this.

Appending data to parquet groups has been asked for several times. I've seen arguments that it is simply not possible without rewriting the file (because thrift uses a lot of absolute file offsets and those offsets, in the portions of the file you are not changing, would become invalid) but I have not investigated it thoroughly enough myself.

Speaking of this... is it a good practice to use row groups instead of hive partitions or is that considered an anti-pattern when speaking of parquet?

There are pros and cons to both. Row groups can be more flexible than hive partitions (e.g. each row group contains statistics for ALL columns and not just some and row group filters can include things like bloom filters). However, hive partitions support append operations (you can always add more files to the month=July folder but you can't add more data to an existing row group).

westonpace commented 1 year ago

Would that be a good addition to pyarrow dataset to optionally ensure the parquet rowgroups contains only one partition?

I'm not sure I understand what you are suggesting.

alippai commented 1 year ago

Setting a partitionby='rowgroups' to write_table so it'd write:

Rowgroup 1:
  ...
  - date: 20230517, value: x
  - date: 20230517, value: x
  - date: 20230517, value: x
  - date: 20230517, value: x
  - date: 20230517, value: x
  - date: 20230517, value: x
Rowgroup 2:
  - date: 20230518, value: x
  - date: 20230518, value: x
  - date: 20230518, value: x
  - date: 20230518, value: x
  - date: 20230518, value: x
  - date: 20230518, value: x
  ...

Instead of the current (based on the row count limit):

Rowgroup 1:
  ...
  - date: 20230517, value: x
  - date: 20230517, value: x
  - date: 20230517, value: x
  - date: 20230518, value: x
  - date: 20230518, value: x
  - date: 20230518, value: x
Rowgroup 2:
  - date: 20230518, value: x
  - date: 20230518, value: x
  - date: 20230518, value: x
  - date: 20230518, value: x
  - date: 20230518, value: x
  - date: 20230518, value: x
  ...
alippai commented 1 year ago

@westonpace reading the parquet thrift doc the naive approach would be keeping the buffers and statistics only, recreating everything else. I didn't know parquet works like this, thanks for the insight!

My goal is slightly different from deltalake and others (and I'm also not fan of JVM based setups for this kind of workload). My idea was relying less on the traditional FS and using the internal structure of the parquet more because of the very reason you've mentioned (filters, statistics). Architecturally Skyhook would be closer to this or "simply" storing all the metadata + statistics in TiKV or other kv store.

mapleFU commented 1 year ago

@alippai I guess it "can" be a better solution, because spliting partition to different row-groups makes reader can prune uneccessary row-group. But I don't know whether current implemention support it.

westonpace commented 1 year ago

Ok, I think I understand better now. I misread this request originally and didn't fully realize that you want to create a new parquet file. I thought you were trying to modify the existing parquet file.

Yes, this makes sense. No, I'm not sure the capability is really there but some of it might be.

The parquet library always decodes its data, as best I can tell. There are some underlying structures like the PageReader which might not. However, there is nothing at the level of "read this row group and append it to another file without decoding".

alippai commented 1 year ago

If I’m right @tustvold created similar low level interfaces. Still looking for the exact MR but maybe he can share what level of abstraction worked well in the rust impl

tustvold commented 1 year ago

https://github.com/apache/arrow-rs/pull/4269 is the PR. Not sure how transferable it is to C++, it is somewhat coupled with the way the write path works, but the basic idea is to allow appending an entire column chunk to a row group.

https://github.com/apache/arrow-rs/pull/4274 contains an example of how to use this to efficiently concatenate files

alippai commented 1 year ago

@tustvold @mapleFU @westonpace (and many others): the speed you are adding new and new parquet features is amazing. Maybe we should start adding a matrix for arrow, arrow-rs, arrow2 (rs), parquet-mr, duckdb to https://arrow.apache.org/docs/status.html so we know what statistics, bloom filters are read and written, which operations are available.

Would you be supportive or it's not the right time now? I can start the MR.

tustvold commented 1 year ago

I think adding documentation of the support within the various arrow projects for parquet makes sense to me, https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/, might serve as some inspiration for further features beyond the obvious support for encoding X or data type Y.

I'm less sure that we should endeavor to maintain up to date feature support for readers outside the arrow umbrella, e.g. parquet-mr, duckdb, arrow2, etc...

alippai commented 1 year ago

Indeed, I didn't realize that's not covered by the current docs. I also favor the less work and more consistency.

westonpace commented 1 year ago

+1 to adding this table somewhere (also, yes, big thanks to @mapleFU and @wgtmac for the recent work). A good first pass would be for each implementation to document what they support locally (e.g. arrow-c++ to add to https://arrow.apache.org/docs/cpp/parquet.html#supported-parquet-features and arrow-rs to add to somewhere in https://docs.rs/parquet/latest/parquet/arrow/index.html)

If we are going to combine them in a table somewhere then maybe we could add to somewhere on https://parquet.apache.org/docs/overview/

That would allow other parquet implementations to contribute their feature list if they chose and might be more appropriate than https://arrow.apache.org/docs/status.html

Although I have no write privileges over there :shrug: so if we want something more local it would probably be ok.

wgtmac commented 1 year ago

I did similar work in the parquet-mr repo to merge row groups of different parquet files without decompression and decoding into a single parquet file (with some supported transformation like re-compression, encryption or dropping columns).

https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java

Is that what you suppose to have in the parquet-cpp? @alippai

alippai commented 1 year ago

@wgtmac In this issue I was looking for a more simple function, appending a new RowGroup (or copying a rowgroup) without merging. Or deleting/replacing rowgroups without materializing the whole file as an Arrow Table.

Overall I think a public RowGroup level (what you have in parquet-mr) and page level API (what @tustvold created for rust) makes sense (without decoding, decompression, statistics and bloom filter re-calculation etc).

wgtmac commented 1 year ago

@wgtmac In this issue I was looking for a more simple function, appending a new RowGroup (or copying a rowgroup) without merging. Or deleting/replacing rowgroups without materializing the whole file as an Arrow Table.

Overall I think a public RowGroup level (what you have in parquet-mr) and page level API (what @tustvold created for rust) makes sense (without decoding, decompression, statistics and bloom filter re-calculation etc).

Yes, I understand your use case. Appending to or modifying a parquet file would require the file system to support mutation or append operation, which is not a typical use case. So merging several parquet files directly on row groups seems to be more generic and can be an alternative solution in your case.

vinothchandar commented 11 months ago

@wgtmac For the rewriting, is there any advantage of using Arrow over parquet-mr. IIUC, you decode the pages there lazily and write back (w or w/o modifications). Maybe for vector processing transformation of the entire page perhaps? e.g x = x + 1 on column x.

wgtmac commented 11 months ago

@wgtmac For the rewriting, is there any advantage of using Arrow over parquet-mr. IIUC, you decode the pages there lazily and write back (w or w/o modifications). Maybe for vector processing transformation of the entire page perhaps? e.g x = x + 1 on column x.

I don't think there is significant difference between Arrow and parquet-mr if pages do not need any modification. When re-compression and/or re-encoding is applied, it would be more performant to go with Arrow.

vinothchandar commented 11 months ago

Thanks. @westonpace Any guidance/pointers from someone wanting to take this forward? Does that make sense to add to Arrow?.

wgtmac commented 11 months ago

Thanks. @westonpace Any guidance/pointers from someone wanting to take this forward? Does that make sense to add to Arrow?.

Just curious: is there any plan to add similar optimization to Apache Hudi? Our old friends at Uber have done a great job: https://www.uber.com/en-HK/blog/fast-copy-on-write-within-apache-parquet/. @vinothchandar

westonpace commented 11 months ago

Thanks. @westonpace Any guidance/pointers from someone wanting to take this forward? Does that make sense to add to Arrow?.

I am not familiar enough with the code in parquet-c++ to be able to give much advice going forwards (@wgtmac and @mapleFU may have an opinion). I think it makes sense as a parquet-c++ feature but probably not as an arrow feature (as you wouldn't need any arrow arrays)

vinothchandar commented 11 months ago

@wgtmac We have an implementation using parquet-mr in the community. I am trying to consolidate all these efforts - ours, parquet-mr and understand plans in Arrow, as we'd like to embrace Arrow (in place of Avro in Hudi 1.0). We can jam more on Hudi Slack if the parquet-mr piece interests you. cc @yihua

Thanks @westonpace. I'll wait to hear more opinions.

wgtmac commented 11 months ago

Sure, that sounds interesting! Let's discuss more about that @vinothchandar