Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.34k stars 164 forks source link

Error when reading data from Delta Lake table on S3 #2292

Closed Sanjay-M closed 5 months ago

Sanjay-M commented 6 months ago

Describe the bug Error while reading the data on Delta Lake on S3 with daft and it is not able to generate a physical plan.

To Reproduce Steps to reproduce the behavior:

  1. Write data to S3 using the Pyspark Delta Lake library with the S3A URI scheme
  2. Read approx more than 5GB of data from delta lake table using Daft
  3. Use delta_lake_read API and try collect() or to_pandas()

Expected behavior Expect it to convert the data frame to pandas or materialize it in the local

Information

Additional context The Python Delta Lake library can read the data properly. df.explain(True), df.collect(), df.to_pandas() gives error but it works with df.limit(1).to_pandas() Error Log with RUST_BACKTRACE=full

thread 'python' panicked at src/daft-stats/src/partition_spec.rs:45:76:
called `Option::unwrap()` on a `None` value
stack backtrace:
   0:     0x7f23f9809238 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h49651f1624eaa20e
   1:     0x7f23f7f42fcb - core::fmt::write::hd8c86404c00a946d
   2:     0x7f23f97d2d8e - std::io::Write::write_fmt::hdc635509a53e4168
   3:     0x7f23f980abb9 - std::sys_common::backtrace::print::he40889ab52a8091b
   4:     0x7f23f980a4d9 - std::panicking::default_hook::{{closure}}::h5de145ca1ef3bb0f
   5:     0x7f23f980b5c6 - std::panicking::rust_panic_with_hook::h66b5e9b0d1f9e688
   6:     0x7f23f980af0c - std::panicking::begin_panic_handler::{{closure}}::h66f180c1c92e0484
   7:     0x7f23f980ae99 - std::sys_common::backtrace::__rust_end_short_backtrace::h07ecf66ca8b75fae
   8:     0x7f23f980ae86 - rust_begin_unwind
   9:     0x7f23f752ded5 - core::panicking::panic_fmt::h2a05afbda6f351f3
  10:     0x7f23f752dfb0 - core::panicking::panic::h57a751f983276759
  11:     0x7f23f752e308 - core::option::unwrap_failed::h96cadc01302443bf
  12:     0x7f23f92a4f15 - <daft_stats::partition_spec::PartitionSpec as core::cmp::PartialEq>::eq::hfdd898d564fa9483
  13:     0x7f23f9249172 - <daft_scan::scan_task_iters::MergeByFileSize as core::iter::traits::iterator::Iterator>::next::heb9f3e3ed674b517
  14:     0x7f23f91179ca - core::iter::adapters::try_process::hbb82326581116e24
  15:     0x7f23f910f8e1 - daft_plan::physical_planner::translate::translate_single_logical_node::h039e18c5dda6be1a
  16:     0x7f23f9176ffe - common_treenode::TreeNode::visit::ha1b8b1861af47a06
  17:     0x7f23f9176c27 - daft_plan::builder::_::<impl daft_plan::builder::PyLogicalPlanBuilder>::__pymethod_to_physical_plan_scheduler__::h059a8d4b6e45d116
  18:     0x7f23f8943634 - pyo3::impl_::trampoline::trampoline::h09583cc9ffdc4c04
  19:     0x7f23f9175831 - daft_plan::builder::_::_::__INVENTORY::trampoline::ha8524b5520a94cce
  20:           0x556bfa - method_vectorcall_VARARGS_KEYWORDS
                               at /usr/local/src/conda/python-3.12.2/Objects/descrobject.c:365:14
  21:           0x546cf1 - _PyObject_VectorcallTstate
                               at /usr/local/src/conda/python-3.12.2/Include/internal/pycore_call.h:92:11
  22:           0x546cf1 - PyObject_Vectorcall
                               at /usr/local/src/conda/python-3.12.2/Objects/call.c:325:12
  23:           0x52d15c - _PyEval_EvalFrameDefault
                               at /croot/python-split_1709054613063/work/build-static/Python/bytecodes.c:2706:19
  24:           0x57fbfc - _PyEval_EvalFrame
                               at /usr/local/src/conda/python-3.12.2/Include/internal/pycore_ceval.h:89:16
  25:           0x57fbfc - gen_send_ex2
                               at /usr/local/src/conda/python-3.12.2/Objects/genobject.c:230:14
  26:           0x57fbfc - gen_iternext
                               at /usr/local/src/conda/python-3.12.2/Objects/genobject.c:603:9
  27:           0x57d274 - list_extend
                               at /usr/local/src/conda/python-3.12.2/Objects/listobject.c:944:26
  28:           0x5aa4e3 - list___init___impl
                               at /usr/local/src/conda/python-3.12.2/Objects/listobject.c:2792
  29:           0x5aa4e3 - list_vectorcall
                               at /usr/local/src/conda/python-3.12.2/Objects/listobject.c:2817
  30:           0x546cf1 - _PyObject_VectorcallTstate
                               at /usr/local/src/conda/python-3.12.2/Include/internal/pycore_call.h:92:11
  31:           0x546cf1 - PyObject_Vectorcall
                               at /usr/local/src/conda/python-3.12.2/Objects/call.c:325:12
  32:           0x52d15c - _PyEval_EvalFrameDefault
                               at /croot/python-split_1709054613063/work/build-static/Python/bytecodes.c:2706:19
  33:           0x5f8a2e - PyEval_EvalCode
                               at /usr/local/src/conda/python-3.12.2/Python/ceval.c:578:21
  34:           0x61d897 - run_eval_code_obj
                               at /usr/local/src/conda/python-3.12.2/Python/pythonrun.c:1722
  35:           0x6191a7 - run_mod
                               at /usr/local/src/conda/python-3.12.2/Python/pythonrun.c:1743
  36:           0x5037bc - PyRun_InteractiveOneObjectEx
                               at /usr/local/src/conda/python-3.12.2/Python/pythonrun.c:260
  37:           0x504151 - _PyRun_InteractiveLoopObject
                               at /usr/local/src/conda/python-3.12.2/Python/pythonrun.c:137
  38:           0x46e057 - _PyRun_AnyFileObject
                               at /usr/local/src/conda/python-3.12.2/Python/pythonrun.c:72
  39:           0x5042b0 - PyRun_AnyFileExFlags
                               at /usr/local/src/conda/python-3.12.2/Python/pythonrun.c:104
  40:           0x4682ca - pymain_run_stdin
                               at /usr/local/src/conda/python-3.12.2/Modules/main.c:520
  41:           0x4682ca - pymain_run_python
                               at /usr/local/src/conda/python-3.12.2/Modules/main.c:632
  42:           0x4682ca - Py_RunMain
                               at /usr/local/src/conda/python-3.12.2/Modules/main.c:709
  43:           0x5e4099 - Py_BytesMain
                               at /usr/local/src/conda/python-3.12.2/Modules/main.c:763:12
  44:     0x7f240823feb0 - __libc_start_call_main
  45:     0x7f240823ff60 - __libc_start_main_impl
  46:           0x5e3ece - <unknown>
  47:                0x0 - <unknown>
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/api_annotations.py", line 26, in _wrap
    return timed_method(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/analytics.py", line 189, in tracked_method
    result = method(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/dataframe/dataframe.py", line 1590, in to_pandas
    self.collect()
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/api_annotations.py", line 26, in _wrap
    return timed_method(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/analytics.py", line 189, in tracked_method
    result = method(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/dataframe/dataframe.py", line 1466, in collect
    self._materialize_results()
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/dataframe/dataframe.py", line 1448, in _materialize_results
    self._result_cache = context.runner().run(self._builder)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/runners/pyrunner.py", line 135, in run
    results = list(self.run_iter(builder))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/runners/pyrunner.py", line 180, in run_iter
    plan_scheduler = builder.to_physical_plan_scheduler(daft_execution_config)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/logical/builder.py", line 44, in to_physical_plan_scheduler
    return PhysicalPlanScheduler(self._builder.to_physical_plan_scheduler(daft_execution_config))
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value
jaychia commented 6 months ago

Thanks @Sanjay-M! We're taking a look at this now :)

jaychia commented 6 months ago

Could you also supply the plan that is printed with df.explain(True)?

Sanjay-M commented 6 months ago

@jaychia It can build the logical plan but throws an error while building the physical plan. I need the IT team to bring up the servers, it will take another 11 hours to post the error log.

jaychia commented 6 months ago

Got it, thanks! Other information that would be helpful for debugging:

It would also be super helpful if you could share the output of your_delta_table.get_add_actions() using the Python Delta Lake library. I'm particularly interested in the data in there under the columns: partition_values, min and max!

Sanjay-M commented 6 months ago

Yes, the delta table is partitioned.

partition_values: struct<sub_tbl: string, system: string, device_type: string, manufacturer: string> not null
  child 0, sub_tbl: string
  child 1, system: string
  child 2, device_type: string
  child 3, manufacturer: string
min: struct<sub_tbl: null, system: null, device_type: null, manufacturer: null, dt: timestamp[us, tz=UTC], ts: timestamp[us, tz=UTC]> not null
max: struct<sub_tbl: null, system: null, device_type: null, manufacturer: null, dt: timestamp[us, tz=UTC], ts: timestamp[us, tz=UTC]> not null

I thought the error could be due to partition column values being null so I tried to replace them with NA

dfd = df.where(df["sub_tbl"] == "abc").select("sub_tbl", "system")
dfd.explain(True)

== Physical Plan ==
* Project: col(sub_tbl), col(system)
|   Clustering spec = { Num partitions = 17 }
|
* TabularScan:
|   Num Scan Tasks = 17
|   Estimated Scan Bytes = 27959464
|   Clustering spec = { Num partitions = 17 }

When I tried to to_pandas() after replacing NULL values, I got the below error

ScanWithTask-Project [Stage:1]:   0%|                                                                                                      | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/api_annotations.py", line 26, in _wrap
    return timed_method(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/analytics.py", line 189, in tracked_method
    result = method(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/dataframe/dataframe.py", line 1590, in to_pandas
    self.collect()
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/api_annotations.py", line 26, in _wrap
    return timed_method(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/analytics.py", line 189, in tracked_method
    result = method(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/dataframe/dataframe.py", line 1466, in collect
    self._materialize_results()
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/dataframe/dataframe.py", line 1448, in _materialize_results
    self._result_cache = context.runner().run(self._builder)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/runners/pyrunner.py", line 135, in run
    results = list(self.run_iter(builder))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/runners/pyrunner.py", line 187, in run_iter
    yield from results_gen
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/runners/pyrunner.py", line 279, in _physical_plan_to_partitions
    materialized_results = done_future.result()
                           ^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/gbp-ml/miniconda3/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/runners/pyrunner.py", line 325, in build_partitions
    partitions = instruction.run(partitions)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/execution/execution_step.py", line 438, in run
    return self._project(inputs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/execution/execution_step.py", line 442, in _project
    return [input.eval_expression_list(self.projection)]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gbp-ml/miniconda3/lib/python3.12/site-packages/daft/table/micropartition.py", line 169, in eval_expression_list
    return MicroPartition._from_pymicropartition(self._micropartition.eval_expression_list(pyexprs))
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
daft.exceptions.DaftCoreException: DaftError::External Parquet file: s3://s3-bucket/data/abc/tbl/zstd/sub_tbl=new/system=ABC/device_type=X/manufacturer=Y/part-00207-78ede249-e608-4e42-a545-72e91bg75166.c000.zstd.parquet metadata listed 1700 rows but only read: 0 
colin-ho commented 5 months ago

Hey @Sanjay-M ! Just merged a fix for this, it should be ready in the next release.