Open nfoerster2 opened 1 month ago
Why is the one, on the cluster showing logs with asyncio and threading?
I spent more time in investigating, seems to be a problem on amd64 with multiple partition levels which are also part of predicates. The reproduce.py script attached below runs on arm64.
I attached:
Core dump: https://file.io/3K1NXipK9eLy
Reproduce python script: reproduce.py.zip
Working script with just one partition: unittest_single_partition.py.zip
Testdata: test_first_16000.parquet.zip
Why is the one, on the cluster showing logs with asyncio and threading?
I think its crashing in the Rust code, attached core dump may give information about that
did someone had the chance to check the core dump?
BR
Environment
Delta-rs version: 0.17.4
Binding: Python 3.12
Environment:
K8s resources: requests: memory: 25Gi cpu: "3" limits: memory: 60Gi cpu: "4"
Environment:
adlfs==2024.2.0 agate==1.7.1 aiohttp==3.9.5 aiosignal==1.3.1 annotated-types==0.7.0 attrs==23.2.0 azure-core==1.30.1 azure-datalake-store==0.0.53 azure-identity==1.15.0 azure-storage-blob==12.20.0 Babel==2.15.0 certifi==2024.2.2 cffi==1.16.0 charset-normalizer==3.3.2 click==8.1.7 colorama==0.4.6 cryptography==42.0.7 dbt-core==1.7.15 dbt-duckdb==1.7.4 dbt-extractor==0.5.1 dbt-semantic-interfaces==0.4.4 deltalake==0.17.4 duckdb==0.10.1 duckdb_deltalake_dbt==0.2.3rc1 frozenlist==1.4.1 fsspec==2024.3.1 idna==3.7 importlib-metadata==6.11.0 isodate==0.6.1 Jinja2==3.1.4 jsonschema==4.22.0 jsonschema-specifications==2023.12.1 leather==0.4.0 Logbook==1.5.3 loguru==0.7.2 MarkupSafe==2.1.5 mashumaro==3.13 minimal-snowplow-tracker==0.0.2 more-itertools==10.2.0 msal==1.28.0 msal-extensions==1.1.0 msgpack==1.0.8 multidict==6.0.5 networkx==3.3 numpy==1.26.4 packaging==24.0 pandas==2.2.2 parsedatetime==2.6 pathspec==0.11.2 polars==0.20.29 portalocker==2.8.2 protobuf==4.25.3 pyarrow==15.0.2 pyarrow-hotfix==0.6 pycparser==2.22 pydantic==2.7.1 pydantic_core==2.18.2 PyJWT==2.8.0 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 python-slugify==8.0.4 pytimeparse==1.1.8 pytz==2024.1 PyYAML==6.0.1 referencing==0.35.1 requests==2.32.2 rpds-py==0.18.1 setuptools==69.5.1 six==1.16.0 sqlparse==0.5.0 text-unidecode==1.3 typing_extensions==4.12.0 tzdata==2024.1 urllib3==1.26.18 wheel==0.43.0 yarl==1.9.4 zipp==3.19.0
Bug
What happened: I have a script which does data munging with duckdb and deltatable. After duckdb query finished I call the arrow() function and pass it to a merge function on a deltatable on azure blob storage. On cloud I get a segmentation fault during merge operation, however locally on my Macbook Pro with M2 and about 8GB RAM it processes all 15M rows in 5min. Local test uses same storage location on azure blob.
I tried during debugging to use a batch reader although cloud resources are much bigger than on the Macbook. Chunk_size 10k rows succeed but is increadible slow, however chunk_size 100k and above results in Segmentation Fault. The column has 17 columns, so 100k should fit in a very small memory. I observed memory in top in parallel, memory is around 10% when segmentation fault happens.
What you expected to happen: Same behavior in cloud as on local. Local deltatable run should be much slower.
How to reproduce it: Here is my code:
More details: Local run:
100k batch on cluster:
10k batch on cluster: