I was running query 2 on a small version of the TPC-H dataset (scale 0.01) and got this error:
File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/prefect/engine.py", line 1744, in orchestrate_task_run
result = await call.aresult()
^^^^^^^^^^^^^^^^^^^^
File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 291, in aresult
return await asyncio.wrap_future(self.future)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 315, in _run_sync
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/james/projects/coiled/etl-tpch/pipeline/reduce.py", line 111, in save_query
result.to_parquet(outdir, compression="snappy", name_function=name)
File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/dask_expr/_collection.py", line 2887, in to_parquet
return to_parquet(self, path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/dask_expr/io/parquet.py", line 252, in to_parquet
if not df.known_divisions:
^^^^^^^^^^^^^^^^^^
File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/dask_expr/_collection.py", line 495, in known_divisions
return self.expr.known_divisions
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/dask_expr/_expr.py", line 391, in known_divisions
return len(self.divisions) > 0 and self.divisions[0] is not None
^^^^^^^^^^^^^^
File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/functools.py", line 1001, in __get__
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/dask_expr/_expr.py", line 383, in divisions
return tuple(self._divisions())
^^^^^^^^^^^^^^^^^
File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/dask_expr/_expr.py", line 2183, in _divisions
self.frame.divisions[self.operand("npartitions") + 1],
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: tuple index out of range
resulted in an empty DataFrame that's not being handled well by dask-expr. Though I've yet to confirm that and will debug more later. Just opening an issue for visibility
I was running query 2 on a small version of the TPC-H dataset (scale 0.01) and got this error:
My guess is this filtering step
https://github.com/coiled/benchmarks/blob/7a070d98de12918bb0520368d042824e6ee9b428/tests/tpch/dask_queries.py#L67-L70
resulted in an empty DataFrame that's not being handled well by
dask-expr
. Though I've yet to confirm that and will debug more later. Just opening an issue for visibilitycc @phofl