Closed dioptre closed 7 months ago
Hi @dioptre! Thanks for making an issue
I looked a little into the error message, and it seems that we currently allocate a maximum of 4MB per page when decoding Parquet pages to bound memory usage. It is likely that your files have some data (Parquet Pages) that exceeds this, which is quite unusual given that most writers should be relatively well behaved and try and right-size the number of rows in a page to be approximately 1MB.
Thanks @jaychia I sent a link to the files to your personal gmail.
0.1.20 is the version
They were written by a third party, the data includes some mongo database backup.
It's very unlikely that a single cell could exceed 4MB but we'd like to support this.
https://drive.google.com/drive/u/2/folders/1cpc11YZKX7s-He3DK7ikjJiWxKJ_IPEf
Also can confirm I tried 3 other loading libraries and all worked except daft.
import pyarrow.parquet as pq
was ok to compare
Also can confirm I tried 3 other loading libraries and all worked except daft.
import pyarrow.parquet as pq
was ok to compare
Thanks! I just managed to confirm that one of the files had a huuuuge 15MB page. This was likely the cause of the issue.
I'll go ahead and bump our page max size limit to be much higher. Look out for a release later today!
Thanks!
I think this probably relates to my last question - would love to cap the size of parquet files - not the size of cells. Would be great to arbitrarily limit the Parquet size to 1GB etc. and leave the page size to whatever is available in memory / max memory set.
Describe the bug Can't convert daft to arrow, also fails on collect
To Reproduce Using 2x 7.5MB parquet files