dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.38k stars 154 forks source link

error when loading multiple resources into `delta` table format with multithreading #1808

Closed jorritsandbrink closed 1 week ago

jorritsandbrink commented 1 week ago

dlt version

0.9.9a1

Describe the problem

Intermittent error when loading multiple resources into the delta table format. It originates as a Rust panic in delta-rs.

Failed run:

image

Another failed run, but now with RUST_BACKTRACE=full:

DeltaLoadFilesystemJob.__run__: r3
DeltaLoadFilesystemJob.__run__: r1
DeltaLoadFilesystemJob.__run__: r2
thread '<unnamed>' panicked at python/src/utils.rs:27:18:
Failed to record PID for tokio runtime.: 33784
stack backtrace:
DeltaLoadFilesystemJob.__run__: r0
DeltaLoadFilesystemJob.__run__: r4
   0:     0x7f33eab19825 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hd736fd5964392270
   1:     0x7f33eab48fdb - core::fmt::write::hc6043626647b98ea
   2:     0x7f33eab15aaf - std::io::Write::write_fmt::h0d24b3e0473045db
   3:     0x7f33eab195fe - std::sys_common::backtrace::print::h45eb8174d25a1e76
   4:     0x7f33eab1ab49 - std::panicking::default_hook::{{closure}}::haf3f0170eb4f3b53
   5:     0x7f33eab1a8ea - std::panicking::default_hook::hb5d3b27aa9f6dcda
   6:     0x7f33eab1afe3 - std::panicking::rust_panic_with_hook::h6b49d59f86ee588c
   7:     0x7f33eab1aec4 - std::panicking::begin_panic_handler::{{closure}}::hd4c2f7ed79b82b70
   8:     0x7f33eab19ce9 - std::sys_common::backtrace::__rust_end_short_backtrace::h2946d6d32d7ea1ad
   9:     0x7f33eab1abf7 - rust_begin_unwind
  10:     0x7f33e6ec5e63 - core::panicking::panic_fmt::ha02418e5cd774672
  11:     0x7f33e6ec6356 - core::result::unwrap_failed::h55f86ada3ace5ed2
  12:     0x7f33e71fb71f - deltalake::utils::rt::ha436d8093aaa10df
  13:     0x7f33e71d7485 - pyo3::marker::Python::allow_threads::h6073ca64fd5c7c1e
  14:     0x7f33e7191143 - deltalake::RawDeltaTable::new::he394dbbf065cffc9
  15:     0x7f33e7195da6 - deltalake::RawDeltaTable::__pymethod___new____::h08af7ab82fa93374
  16:     0x7f33e71794bc - pyo3::impl_::trampoline::trampoline::h684b0d91afb56a75
  17:     0x7f33e71958e1 - deltalake::<impl pyo3::impl_::pyclass::PyMethods<deltalake::RawDeltaTable> for pyo3::impl_::pyclass::PyClassImplCollector<deltalake::RawDeltaTable>>::py_methods::ITEMS::trampoline::h457f037821f80b20
  18:     0x55b5a4204477 - _PyObject_MakeTpCall
  19:     0x55b5a41fd871 - _PyEval_EvalFrameDefault
  20:     0x55b5a420e6ac - _PyFunction_Vectorcall
  21:     0x55b5a420376d - _PyObject_FastCallDictTstate
  22:     0x55b5a42187a4 - <unknown>
  23:     0x55b5a42044cc - _PyObject_MakeTpCall
  24:     0x55b5a41fd871 - _PyEval_EvalFrameDefault
  25:     0x55b5a420e6ac - _PyFunction_Vectorcall
  26:     0x55b5a41f7c16 - _PyEval_EvalFrameDefault
  27:     0x55b5a420e6ac - _PyFunction_Vectorcall
  28:     0x55b5a41f6b2b - _PyEval_EvalFrameDefault
  29:     0x55b5a420e6ac - _PyFunction_Vectorcall
  30:     0x55b5a41f6b2b - _PyEval_EvalFrameDefault
  31:     0x55b5a420e6ac - _PyFunction_Vectorcall
  32:     0x55b5a41f6b2b - _PyEval_EvalFrameDefault
  33:     0x55b5a420e6ac - _PyFunction_Vectorcall
  34:     0x55b5a41f8ca9 - _PyEval_EvalFrameDefault
  35:     0x55b5a420e6ac - _PyFunction_Vectorcall
  36:     0x55b5a41f8ca9 - _PyEval_EvalFrameDefault
  37:     0x55b5a420e6ac - _PyFunction_Vectorcall
  38:     0x55b5a41f6b2b - _PyEval_EvalFrameDefault
  39:     0x55b5a420e6ac - _PyFunction_Vectorcall
  40:     0x55b5a41f8ca9 - _PyEval_EvalFrameDefault
  41:     0x55b5a420e6ac - _PyFunction_Vectorcall
  42:     0x55b5a41f6b2b - _PyEval_EvalFrameDefault
  43:     0x55b5a420e6ac - _PyFunction_Vectorcall
  44:     0x55b5a41f6b2b - _PyEval_EvalFrameDefault
  45:     0x55b5a421c4b1 - <unknown>
  46:     0x55b5a434581a - <unknown>
  47:     0x55b5a433ac48 - <unknown>
  48:     0x7f342c2a7ac3 - <unknown>
  49:     0x7f342c339850 - <unknown>
  50:                0x0 - <unknown>

On Windows the backtrace is slightly different:

thread '<unnamed>' panicked at python\src\utils.rs:27:18:
Failed to record PID for tokio runtime.: 4144
stack backtrace:
   0:     0x7ff9ee31423d - bz_internal_error
   1:     0x7ff9ee33b079 - bz_internal_error
   2:     0x7ff9ee30f0b1 - bz_internal_error
   3:     0x7ff9ee314016 - bz_internal_error
   4:     0x7ff9ee316188 - bz_internal_error
   5:     0x7ff9ee315e36 - bz_internal_error
   6:     0x7ff9ee3166b8 - bz_internal_error
   7:     0x7ff9ee316577 - bz_internal_error
   8:     0x7ff9ee314baf - bz_internal_error
   9:     0x7ff9ee316228 - bz_internal_error
  10:     0x7ff9ee4b3724 - bz_internal_error
  11:     0x7ff9ee4b3bc0 - bz_internal_error
  12:     0x7ff9ea7b178b - PyInit__internal
  13:     0x7ff9ea793a45 - PyInit__internal
  14:     0x7ff9ea730548 - <unknown>
  15:     0x7ff9ea735f0f - <unknown>
  16:     0x7ff9ea712285 - <unknown>
  17:     0x7ff9ea735931 - <unknown>
  18:     0x7ffa6cca683e - PyTuple_New
  19:     0x7ffa6cca53aa - PyObject_MakeTpCall
  20:     0x7ffa6cc87e59 - PyEval_EvalFrameDefault
  21:     0x7ffa6cc7e618 - PyEval_EvalCodeWithName
  22:     0x7ffa6cc7fd5f - PyFunction_Vectorcall
  23:     0x7ffa6cd4a48d - PyFloat_GetInfo
  24:     0x7ffa6cca68b6 - PyTuple_New
  25:     0x7ffa6cca53aa - PyObject_MakeTpCall
  26:     0x7ffa6cc87e59 - PyEval_EvalFrameDefault
  27:     0x7ffa6cc7e618 - PyEval_EvalCodeWithName
  28:     0x7ffa6cc7fd5f - PyFunction_Vectorcall
  29:     0x7ffa6cc86b9e - PyEval_EvalFrameDefault
  30:     0x7ffa6cc82d24 - PyEval_EvalFrameDefault
  31:     0x7ffa6cc82d24 - PyEval_EvalFrameDefault
  32:     0x7ffa6cc82d24 - PyEval_EvalFrameDefault
  33:     0x7ffa6cc7fc7d - PyFunction_Vectorcall
  34:     0x7ffa6cc9a6a2 - PyVectorcall_Call
  35:     0x7ffa6cc9a533 - PySequence_GetItem
  36:     0x7ffa6cc867a5 - PyEval_EvalFrameDefault
  37:     0x7ffa6cc7e618 - PyEval_EvalCodeWithName
  38:     0x7ffa6cc7fd5f - PyFunction_Vectorcall
  39:     0x7ffa6cc9a6a2 - PyVectorcall_Call
  40:     0x7ffa6cc9a533 - PySequence_GetItem
  41:     0x7ffa6cc867a5 - PyEval_EvalFrameDefault
  42:     0x7ffa6cc82d24 - PyEval_EvalFrameDefault
  43:     0x7ffa6cc7fc7d - PyFunction_Vectorcall
  44:     0x7ffa6cc9a6a2 - PyVectorcall_Call
  45:     0x7ffa6cc9a533 - PySequence_GetItem
  46:     0x7ffa6cc867a5 - PyEval_EvalFrameDefault
  47:     0x7ffa6cc82d24 - PyEval_EvalFrameDefault
  48:     0x7ffa6cc82d24 - PyEval_EvalFrameDefault
  49:     0x7ffa6cc7fc7d - PyFunction_Vectorcall
  50:     0x7ffa6cc7e2ec - PyObject_GetBuffer
  51:     0x7ffa6cc9a6a2 - PyVectorcall_Call
  52:     0x7ffa6cc98cf9 - PyObject_Call
  53:     0x7ffa6ccf8a0f - PyImport_FindBuiltin
  54:     0x7ffa6ccf8996 - PyImport_FindBuiltin
  55:     0x7ffab6129333 - recalloc
  56:     0x7ffab879257d - BaseThreadInitThunk
  57:     0x7ffab8b8af28 - RtlUserThreadStart

Expected behavior

No error.

Steps to reproduce

Run below script a couple of times. Error does not surface on each run, but it usually does not take many runs to pop up (at least in my env).

import os
import dlt
from dlt.destinations import filesystem

# os.environ["LOAD__WORKERS"] = "1"  # error seems not to appear when disabling multithreading
os.environ["RUST_BACKTRACE"] = "full"

num_resources = 5
resources = [
    dlt.resource([{"foo": "bar"}], name=f"r{n}")
    for n in range(num_resources)
]
pipe = dlt.pipeline(
    pipeline_name="delta_source",
    pipelines_dir="_storage",
    destination=filesystem("_storage"),
)

pipe.run(resources, table_format="delta")

Operating system

Linux, Windows

Runtime environment

Local

Python version

3.10

dlt data source

No response

dlt destination

Filesystem & buckets

Other deployment details

No response

Additional information

Maybe delta-rs's multithreading doesn't play nice with dlt's multithreading.

rudolfix commented 1 week ago

recently I see the same. and it was not there previously so I expect this is something in the newest delta release.

  1. should we report this in delta-rs repo?
  2. we can add thread lock on the whole operation or when any new instance of DeltaTable is created (looking at the stack trace, this is failing)
  3. we do not see it on CI at all so maybe it only happens on WSL (I'm on it)