dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
778 stars 178 forks source link

memory usage keeps increasing when writing #286

Open robertdavidwest opened 6 years ago

robertdavidwest commented 6 years ago

Hey there,

I'm writing a very large pandas dataframe to a fast parquet file on s3. I'm finding that as it writes the amount of memory being used keeps increasing dramatically. Instinctively something feels off here. The dataframe in memory is around ~100GB. I'm about 12 hours into writing with 344,610 partitions written to disk and now the amount of memory being used is up to 336GB.

import pandas as pd
from fastparquet import write
from s3fs import S3FileSystem

s3 = S3FileSystem(s3_additional_kwargs={'ServerSideEncryption': 'AES256'})

df = pd.DataFrame("...")
write(outpath,
        data=df,
        file_scheme='hive',
        object_encoding='bytes',
        partition_on=['A', 'B', 'C'],
        open_with=s3.open)

I'll see if I can recreate this issue on a random test dataframe. python 2.7. pandas 0.22.0, fastparquet 0.1.3, s3fs 0.1.2

martindurant commented 6 years ago

I would suggest that your partition size is much too small: ~3MB. I would aim for >200MB. For each partition that is written, the metadata is stored in memory until all have been accumulated, then the special metadata file is also written. I don't know what the size of the metadata for each chunk will be for you (it depends on the number of columns and other factors), but their memory use can be substantial.

In addition, your experience suggests that some of the data itself is also being held onto during writing, and the garbage collector is not successfully releasing the memory. It is not surprising that each chunk of data uses a few times the original memory as it is converted from pandas columns to bytes and written, but that should be freed as the code moves on to the next chunk. If you can memory-profile your code, it would be very useful to find out how this is happening.

robertdavidwest commented 6 years ago

Thank you. That is helpful to know on instincts for the partition size. Perhaps I'll just try partitioning on the two of the 3 columns.