apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.55k stars 1.39k forks source link

[C++] Add an API to allow writing RowGroups based on their size rather than num_rows #2208

Closed asfimport closed 6 years ago

asfimport commented 6 years ago

The current API allows writing RowGroups with specified numbers of rows, however does not allow writing RowGroups with specified size. In order to write RowGroups of specified size we need to write rows in chunks while checking the total_bytes_written after each chunk is written. This is currently impossible because the call to NextColumn() closes the current column writer.

Reporter: Anatoli Shein Assignee: Deepak Majeti / @majetideepak

PRs and other links:

Note: This issue was originally created as PARQUET-1372. Please see the migration documentation for further details.

asfimport commented 6 years ago

Renato Javier Marroquín Mogrovejo / @renato2099: This is a very useful feature indeed [~anatoli.shein] ! Just one quick clarification question, are you planning to implement fixed row group sizes? i.e., all of them having for example 8MB? If so, how are you planning to deal with record boundaries? Cutting them off? or making each row group size approximately what was configured? Thanks!

asfimport commented 6 years ago

Anatoli Shein: @renato2099, thanks! So the current plan is to make each row group approximately the given size without going over, while each row group should also have at least one record.

asfimport commented 6 years ago

Uwe Korn / @xhochy: Issue resolved by pull request 484 https://github.com/apache/parquet-cpp/pull/484