Closed Sanjay-M closed 5 months ago
Thanks @Sanjay-M! We're taking a look at this now :)
Could you also supply the plan that is printed with df.explain(True)
?
@jaychia It can build the logical plan but throws an error while building the physical plan. I need the IT team to bring up the servers, it will take another 11 hours to post the error log.
Got it, thanks! Other information that would be helpful for debugging:
It would also be super helpful if you could share the output of your_delta_table.get_add_actions()
using the Python Delta Lake library. I'm particularly interested in the data in there under the columns: partition_values
, min
and max
!
Yes, the delta table is partitioned.
partition_values: struct<sub_tbl: string, system: string, device_type: string, manufacturer: string> not null
child 0, sub_tbl: string
child 1, system: string
child 2, device_type: string
child 3, manufacturer: string
min: struct<sub_tbl: null, system: null, device_type: null, manufacturer: null, dt: timestamp[us, tz=UTC], ts: timestamp[us, tz=UTC]> not null
max: struct<sub_tbl: null, system: null, device_type: null, manufacturer: null, dt: timestamp[us, tz=UTC], ts: timestamp[us, tz=UTC]> not null
I thought the error could be due to partition column values being null so I tried to replace them with NA
dfd = df.where(df["sub_tbl"] == "abc").select("sub_tbl", "system")
dfd.explain(True)
== Physical Plan ==
* Project: col(sub_tbl), col(system)
| Clustering spec = { Num partitions = 17 }
|
* TabularScan:
| Num Scan Tasks = 17
| Estimated Scan Bytes = 27959464
| Clustering spec = { Num partitions = 17 }
When I tried to to_pandas() after replacing NULL values, I got the below error
ScanWithTask-Project [Stage:1]: 0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/api_annotations.py", line 26, in _wrap
return timed_method(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/analytics.py", line 189, in tracked_method
result = method(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/dataframe/dataframe.py", line 1590, in to_pandas
self.collect()
File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/api_annotations.py", line 26, in _wrap
return timed_method(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/analytics.py", line 189, in tracked_method
result = method(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/dataframe/dataframe.py", line 1466, in collect
self._materialize_results()
File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/dataframe/dataframe.py", line 1448, in _materialize_results
self._result_cache = context.runner().run(self._builder)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/runners/pyrunner.py", line 135, in run
results = list(self.run_iter(builder))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/runners/pyrunner.py", line 187, in run_iter
yield from results_gen
File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/runners/pyrunner.py", line 279, in _physical_plan_to_partitions
materialized_results = done_future.result()
^^^^^^^^^^^^^^^^^^^^
File "/home/gbp-ml/miniconda3/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/home/gbp-ml/miniconda3/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/home/gbp-ml/miniconda3/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/runners/pyrunner.py", line 325, in build_partitions
partitions = instruction.run(partitions)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/execution/execution_step.py", line 438, in run
return self._project(inputs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/execution/execution_step.py", line 442, in _project
return [input.eval_expression_list(self.projection)]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/table/micropartition.py", line 169, in eval_expression_list
return MicroPartition._from_pymicropartition(self._micropartition.eval_expression_list(pyexprs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
daft.exceptions.DaftCoreException: DaftError::External Parquet file: s3://s3-bucket/data/abc/tbl/zstd/sub_tbl=new/system=ABC/device_type=X/manufacturer=Y/part-00207-78ede249-e608-4e42-a545-72e91bg75166.c000.zstd.parquet metadata listed 1700 rows but only read: 0
Hey @Sanjay-M ! Just merged a fix for this, it should be ready in the next release.
Describe the bug Error while reading the data on Delta Lake on S3 with daft and it is not able to generate a physical plan.
To Reproduce Steps to reproduce the behavior:
Expected behavior Expect it to convert the data frame to pandas or materialize it in the local
Information
Additional context The Python Delta Lake library can read the data properly. df.explain(True), df.collect(), df.to_pandas() gives error but it works with df.limit(1).to_pandas() Error Log with RUST_BACKTRACE=full