apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.51k stars 746 forks source link

API to get memory usage for parquet ArrowWriter #5851

Closed alamb closed 2 months ago

alamb commented 3 months ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. When writing parquet files, depending on the writer settings and the data being written, we have observed the ArrowWriter consuming large amounts of memory (10s of GB) -- see https://github.com/apache/arrow-rs/issues/5828

The memory usage of parquet writers also often comes up in the context of proposals for new parquet formats

There is already a discussion about how to limit memory when writing here https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html#memory-limiting

However there is now way currently to get a measurement of actual current use (that we could use to abort the write, for example).

Describe the solution you'd like

I would like some way to get to have some visibility on the current memory usage of the internal buffering in the parquet writer

Describe alternatives you've considered I propose adding a function to ArrowWriter modeled on Array::get_array_memory_size

impl ArrayWriter {
  /// returns an estimate of how much memory the array
  /// writer is currently using in its internal buffers. 
  fn memory_size(&self) -> usize { ... }
...
}

Additional context Here is one ticket that describes one non trivial source of memory usage https://github.com/apache/arrow-rs/issues/5828 so the indices should be included.

alamb commented 3 months ago

BTE there is already https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html#method.in_progress_size but that only accounts for the actual parquet data in progress, not any internal buffering structures in the writer itself

Rachelint commented 3 months ago

can I take it?

alamb commented 3 months ago

That would be amazing -- thank you @Rachelint

alamb commented 3 months ago

I'll be on the lookout for a PR -- please ping me when you are ready for feedback

Rachelint commented 3 months ago

I'll be on the lookout for a PR -- please ping me when you are ready for feedback

ok!

alamb commented 3 months ago

@wiedld made a PR for this feature: https://github.com/apache/arrow-rs/pull/5967 as well

Rachelint commented 3 months ago

@wiedld made a PR for this feature: #5967 as well

ok, planned to code it in weekend, so still no codes

alamb commented 3 months ago

@wiedld made a PR for this feature: #5967 as well

ok, planned to code it in weekend, so still no codes

Thanks @Rachelint -- maybe you could help review https://github.com/apache/arrow-rs/pull/5967 if you are still interested

wiedld commented 3 months ago

Draft PR is up and undergoing code review. Please assign to me.