coiled / benchmarks

BSD 3-Clause "New" or "Revised" License
32 stars 17 forks source link

Error running query 2 on scale 0.01 version of the dataset #1393

Closed jrbourbeau closed 9 months ago

jrbourbeau commented 9 months ago

I was running query 2 on a small version of the TPC-H dataset (scale 0.01) and got this error:

  File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/prefect/engine.py", line 1744, in orchestrate_task_run
    result = await call.aresult()
             ^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 291, in aresult
    return await asyncio.wrap_future(self.future)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 315, in _run_sync
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/projects/coiled/etl-tpch/pipeline/reduce.py", line 111, in save_query
    result.to_parquet(outdir, compression="snappy", name_function=name)
  File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/dask_expr/_collection.py", line 2887, in to_parquet
    return to_parquet(self, path, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/dask_expr/io/parquet.py", line 252, in to_parquet
    if not df.known_divisions:
           ^^^^^^^^^^^^^^^^^^
  File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/dask_expr/_collection.py", line 495, in known_divisions
    return self.expr.known_divisions
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/dask_expr/_expr.py", line 391, in known_divisions
    return len(self.divisions) > 0 and self.divisions[0] is not None
               ^^^^^^^^^^^^^^
  File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/functools.py", line 1001, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/dask_expr/_expr.py", line 383, in divisions
    return tuple(self._divisions())
                 ^^^^^^^^^^^^^^^^^
  File "/Users/james/mambaforge/envs/etl-tpch2/lib/python3.11/site-packages/dask_expr/_expr.py", line 2183, in _divisions
    self.frame.divisions[self.operand("npartitions") + 1],
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: tuple index out of range

My guess is this filtering step

https://github.com/coiled/benchmarks/blob/7a070d98de12918bb0520368d042824e6ee9b428/tests/tpch/dask_queries.py#L67-L70

resulted in an empty DataFrame that's not being handled well by dask-expr. Though I've yet to confirm that and will debug more later. Just opening an issue for visibility

cc @phofl