[TPC-H] DuckDB failures in benchmarks cascade

hendrikmakait commented 7 months ago

While running TPC-H benchmarks at scale 1000 with DuckDB, I've noticed that failures cascade and cause subsequent tests to fail as well.

Cluster: https://cloud.coiled.io/clusters/383513/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern=

2024-02-13 18:22:04.1830
scheduler

distributed.worker - WARNING - Compute Failed
Key:       _run-72060fd5fcc65f1622e4f50f91a914b9
Function:  _run
args:      (<function test_query_22.<locals>._ at 0x7ed6c7f0a8e0>)
kwargs:    {}
Exception: "RuntimeError('Resource temporarily unavailable')"
2024-02-13 18:22:03.9410
scheduler

distributed.worker - WARNING - Compute Failed
Key:       _run-f550c43b84d01df05082bff5dd59362b
Function:  _run
args:      (<function test_query_21.<locals>._ at 0x7ed6c7f0af20>)
kwargs:    {}
Exception: "RuntimeError('Resource temporarily unavailable')"
2024-02-13 18:22:03.6170
scheduler

distributed.worker - WARNING - Compute Failed
Key:       _run-d10da49c53a3d87187dab382106bacb1
Function:  _run
args:      (<function test_query_20.<locals>._ at 0x7ed6c7f09b20>)
kwargs:    {}
Exception: "RuntimeError('Resource temporarily unavailable')"
2024-02-13 18:22:03.3370
scheduler

distributed.worker - WARNING - Compute Failed
Key:       _run-0e5ce454b9399d64e15ffc8a5a04884f
Function:  _run
args:      (<function test_query_19.<locals>._ at 0x7ed6c7f0a520>)
kwargs:    {}
Exception: "RuntimeError('Resource temporarily unavailable')"
2024-02-13 18:22:03.0680
scheduler

distributed.worker - WARNING - Compute Failed
Key:       _run-1ad7b09ea882f349b71a4b294d416889
Function:  _run
args:      (<function test_query_18.<locals>._ at 0x7ed6c7f087c0>)
kwargs:    {}
Exception: "RuntimeError('Resource temporarily unavailable')"
2024-02-13 18:22:02.8610
scheduler

distributed.worker - WARNING - Compute Failed
Key:       _run-e4dd67a2e1fbcaeded8bca75ddc7b7f3
Function:  _run
args:      (<function test_query_17.<locals>._ at 0x7ed6c7f08360>)
kwargs:    {}
Exception: "RuntimeError('Resource temporarily unavailable')"
2024-02-13 18:22:02.6580
scheduler

distributed.worker - WARNING - Compute Failed
Key:       _run-dee16bb9bcba3ff47ee2bcfc5cee2367
Function:  _run
args:      (<function test_query_16.<locals>._ at 0x7ed6c7f09760>)
kwargs:    {}
Exception: "RuntimeError('Resource temporarily unavailable')"
2024-02-13 18:22:02.1710
scheduler

distributed.worker - WARNING - Compute Failed
Key:       _run-8bc2ddf6669b0df0a8a678174fa260f5
Function:  _run
args:      (<function test_query_15.<locals>._ at 0x7ed6c7f089a0>)
kwargs:    {}
Exception: "RuntimeError('Resource temporarily unavailable')"
2024-02-13 18:22:01.9480
scheduler

distributed.worker - WARNING - Compute Failed
Key:       _run-c9c35f8f901588a9e1fe9df1dca2b327
Function:  _run
args:      (<function test_query_14.<locals>._ at 0x7ed6c7f085e0>)
kwargs:    {}
Exception: "RuntimeError('Resource temporarily unavailable')"
2024-02-13 18:22:01.6860
scheduler

distributed.worker - WARNING - Compute Failed
Key:       _run-ad42bd4b470945c495a100dcbae396e1
Function:  _run
args:      (<function test_query_13.<locals>._ at 0x7ed6c7f08c20>)
kwargs:    {}
Exception: "RuntimeError('Resource temporarily unavailable')"
2024-02-13 18:22:01.2670
scheduler

distributed.worker - WARNING - Compute Failed
Key:       _run-65a7eb84d3ef35863cef094d2e444715
Function:  _run
args:      (<function test_query_12.<locals>._ at 0x7ed6c7f09300>)
kwargs:    {}
Exception: "RuntimeError('Resource temporarily unavailable')"
2024-02-13 18:22:01.0050
scheduler

distributed.worker - WARNING - Compute Failed
Key:       _run-e7f3e060b3c4abdeb296d73088c19a94
Function:  _run
args:      (<function test_query_11.<locals>._ at 0x7ed6c7f08860>)
kwargs:    {}
Exception: "RuntimeError('Resource temporarily unavailable')"
2024-02-13 18:22:00.7000
scheduler

distributed.worker - WARNING - Compute Failed
Key:       _run-b17b863a8e3619cf23ceb84c5e902fed
Function:  _run
args:      (<function test_query_10.<locals>._ at 0x7ed6c7f084a0>)
kwargs:    {}
Exception: "RuntimeError('Resource temporarily unavailable')"
2024-02-13 18:22:00.4760
scheduler

distributed.worker - WARNING - Compute Failed
Key:       _run-0888cd3a7aa736bfe5349da01af0f172
Function:  _run
args:      (<function test_query_9.<locals>._ at 0x7f1302055bc0>)
kwargs:    {}
Exception: "OutOfMemoryException('Out of Memory Error: Failed to allocate block of 262144 bytes')"

hendrikmakait commented 7 months ago

FWIW, the OOM error itself could possibly be solved by re-building DuckDB without jemalloc (https://github.com/duckdb/duckdb/issues/8135).

phofl commented 7 months ago

I don't think that we want to rebuild duckdb ourselves?

phofl commented 7 months ago

We might want to send them a reproducer though if it reproduces consistently for us

hendrikmakait commented 7 months ago

I don't think that we want to rebuild duckdb ourselves?

I'm not saying we should, this was more of a way for me to log a possible related issue (and workaround). I agree that we may want to send them a reproducer if this persists.

Another possible related issue: https://github.com/duckdb/duckdb/issues/3391

hendrikmakait commented 7 months ago

Looking at previous runs, this is not perfectly reproducible, but we can usually reproduce it at some point during a scale 1000 run:

Clusters: https://cloud.coiled.io/clusters/380463/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern= https://cloud.coiled.io/clusters/380432/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern= https://cloud.coiled.io/clusters/379551/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern= https://cloud.coiled.io/clusters/377502/information?viewedAccount=%22dask-benchmarks%22&tab=Logs&filterPattern=

hendrikmakait commented 7 months ago

Fixed by #1400

coiled / benchmarks

[TPC-H] DuckDB failures in benchmarks cascade #1387