Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.82k stars 113 forks source link

[BUG] Physical plan breaking 50% of the time #1843

Closed dioptre closed 4 months ago

dioptre commented 4 months ago

Describe the bug Physical plan breaking: thread '' panicked at 'no entry found for key', src/daft-plan/src/physical_plan.rs:386:28

File "/Users/andrewgrosser/Library/Caches/pypoetry/virtualenvs/ml-Yqfv2jYI-py3.10/lib/python3.10/site-packages/daft/plan_scheduler/physical_plan_scheduler.py", line 22, in to_partition_tasks
 return physical_plan.materialize(self._scheduler.to_partition_tasks(psets, is_ray_runner))
pyo3_runtime.PanicException: no entry found for key

To Reproduce Selecting multiple parquet files from s3 doing daft.read_parquet with native downloader

Expected behavior Working download

Desktop (please complete the following information):

Additional context version 0.2.12

Gets: 0, Heads: 0, Lists: 2, BytesRead: 0, AvgGetSize: 0
ScanWithTask [Stage:1]:   0%|          | 0/1 [00:00<?, ?it/s]
thread '<unnamed>' panicked at 'no entry found for key', src/daft-plan/src/physical_plan.rs:386:28

I'm blocked by using daft due to the unreliability - please help!

samster25 commented 4 months ago

Hi @dioptre! Thanks for raising this issue! Could you share a code snippet of causes this error to happen?

Based off the line number it looks like it may be an in-memory scan?

dioptre commented 4 months ago

I run:

daft.read_parquet(
                            [parquets in s3 array], use_native_downloader=True
                        ).to_arrow()
samster25 commented 4 months ago

could you also run

df = daft.read_parquet(
       [parquets in s3 array], 
       use_native_downloader=True
)
df.explain()
dioptre commented 4 months ago

We get a segmentation fault, so that won't be possible.

dioptre commented 4 months ago

Please know that we are getting successes then failures on the same files!