Open eitsupi opened 2 years ago
Hi @eitsupi ,
Depending on your memory restrictions, you may need to control the batch size (how many rows are loaded at once) on the scanner:
import pyarrow.dataset as ds
input_dataset = ds.dataset("input")
scanner = inpute_dataset.scanner(batch_size=100_000) # default is 1_000_000
ds.write_dataset(scanner.to_reader(), "output", format="parquet")
Does that help in your use case?
At the moment we generally use too much memory when scanning parquet. This is because the scanner's readahead is unfortunately based on the row group size and not the batch size. Using smaller row groups in your source files will help. #12228 changes the readahead to be based on the batch size but it's been on my back burner for a bit. I'm still optimistic I will get to it for the 8.0.0 release.
Thank you both. I tried lowering the batch size to 1000 in Python, but it still consumed over 3GB of memory and crashed.
I will wait for the 8.0.0 release to try this again.
I tried pyarrow 8.0.0 and unfortunately it still crashes.
Finding the same thing in pyarrow 8.0.0
converting from a CSV to Parquet - I've tried various batch sizes on the scanner and various min/max rows/groups on the writer.
Running in a container the memory usage increases to maximum and eventually crashes.
from pathlib import Path
import pyarrow.csv as csv
import pyarrow.dataset as ds
tsv_directory_path = Path("/dir/with/tsv")
read_schema = pa.schema([...])
input_tsv_dataset = ds.dataset(
tsv_directory_path,
read_schema,
format=ds.CsvFileFormat(
parse_options=csv.ParseOptions(delimiter="\t", quote_char=False)
),
)
scanner = input_tsv_dataset.scanner(batch_size=100)
ds.write_dataset(
scanner,
"output_directory.parquet",
format="parquet",
max_rows_per_file=10000,
max_rows_per_group=10000,
min_rows_per_group=10,
)
Using the csv.open_csv
and pq.ParquetWriter
to write batches to a single file works fine, but results in a single large file.
I am noticing the same issue with pyarrow 8.0.0. Memory usage steadily increases to over 10GB while reading batches from a 15GB Parquet file, even with batch size 1. The rows vary a fair bit in size in this dataset, but not enough to require that much RAM.
For what it's worth, I've found that passing use_threads=False
as an argument to scanner
prevents the memory footprint from growing as large (not growing past ~3GB in this case, but still fluctuating by a fair bit), after noticing that this implicitly disables both batch and fragment readahead here. The performance penalty isn't particularly large, especially with bigger batch sizes, so this may be a temporary solution for those wishing to keep memory usage low.
Having found the following description in the documentation, I tried the operation of scanning a dataset larger than memory and writing it to another dataset.
https://arrow.apache.org/docs/python/dataset.html#writing-large-amounts-of-data
But both Python and R on Windows crashed due to lack of memory. Am I missing something? Is there a recommended way to convert one dataset to another without running out of computer memory?