Open hendrikmakait opened 9 months ago
Query 18 most likely dies because our source dataset is weird. We have files that have 50mbs in memory and files that have 380mbs in memory. The latter is relatively big for our small machines (8GB of ram). This gets worse through our strategy of combining multiple partitions when we drop columns, we end up combining a few large ones which makes them even bigger.
I don't know how we want to proceed exactly, but the varying partitions are probably not very good for what we want to do here.
Edit: This is not compression related, the difference scales down to compressed file sizes
Varying partition sizes are very realistic and we shouldn't micro optimize our code to only run on extremely homogeneous datasets
I agree, but this is hard to change with the Current read_parquet
See https://github.com/coiled/benchmarks/issues/1376 for 17 and 18
At scale 1000, all of these queries have workers getting restarted after running out of memory. We should investigate the cause and see if we're missing optimizations, have chosen a poor join order, or whether there are any other issues with these queries.
query_9
query_11
query_13
query_17
query_18
query_19
Note thatquery_21
is excluded from this list due to #1362.