[TPC-H] Workers get restarted after runnning out of memory during multiple queries at scale 1000

coiled / benchmarks

BSD 3-Clause "New" or "Revised" License

32 stars 17 forks source link

[TPC-H] Workers get restarted after runnning out of memory during multiple queries at scale 1000 #1367

Open hendrikmakait opened 9 months ago

hendrikmakait commented 9 months ago

At scale 1000, all of these queries have workers getting restarted after running out of memory. We should investigate the cause and see if we're missing optimizations, have chosen a poor join order, or whether there are any other issues with these queries.

[ ] query_9
[ ] query_11
[ ] query_13
[ ] query_17
[ ] query_18
[ ] query_19 Note that query_21 is excluded from this list due to #1362.

phofl commented 9 months ago

Query 18 most likely dies because our source dataset is weird. We have files that have 50mbs in memory and files that have 380mbs in memory. The latter is relatively big for our small machines (8GB of ram). This gets worse through our strategy of combining multiple partitions when we drop columns, we end up combining a few large ones which makes them even bigger.

I don't know how we want to proceed exactly, but the varying partitions are probably not very good for what we want to do here.

Edit: This is not compression related, the difference scales down to compressed file sizes

fjetter commented 9 months ago

Varying partition sizes are very realistic and we shouldn't micro optimize our code to only run on extremely homogeneous datasets

phofl commented 9 months ago

I agree, but this is hard to change with the Current read_parquet

phofl commented 9 months ago

See https://github.com/coiled/benchmarks/issues/1376 for 17 and 18