Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.79k stars 108 forks source link

Getting 'DaftError::ArrowError External format error: Operation would exceed memory use threshold' #1551

Closed dioptre closed 7 months ago

dioptre commented 7 months ago

Describe the bug Can't convert daft to arrow, also fails on collect

To Reproduce Using 2x 7.5MB parquet files

jaychia commented 7 months ago

Hi @dioptre! Thanks for making an issue

I looked a little into the error message, and it seems that we currently allocate a maximum of 4MB per page when decoding Parquet pages to bound memory usage. It is likely that your files have some data (Parquet Pages) that exceeds this, which is quite unusual given that most writers should be relatively well behaved and try and right-size the number of rows in a page to be approximately 1MB.

dioptre commented 7 months ago

Thanks @jaychia I sent a link to the files to your personal gmail.

0.1.20 is the version

They were written by a third party, the data includes some mongo database backup.

It's very unlikely that a single cell could exceed 4MB but we'd like to support this.

https://drive.google.com/drive/u/2/folders/1cpc11YZKX7s-He3DK7ikjJiWxKJ_IPEf

dioptre commented 7 months ago

Also can confirm I tried 3 other loading libraries and all worked except daft.

import pyarrow.parquet as pq was ok to compare

jaychia commented 7 months ago

Also can confirm I tried 3 other loading libraries and all worked except daft.

import pyarrow.parquet as pq was ok to compare

Thanks! I just managed to confirm that one of the files had a huuuuge 15MB page. This was likely the cause of the issue.

I'll go ahead and bump our page max size limit to be much higher. Look out for a release later today!

dioptre commented 7 months ago

Thanks!

dioptre commented 7 months ago

I think this probably relates to my last question - would love to cap the size of parquet files - not the size of cells. Would be great to arbitrarily limit the Parquet size to 1GB etc. and leave the page size to whatever is available in memory / max memory set.