JDASoftwareGroup / kartothek

A consistent table management library in python
https://kartothek.readthedocs.io/en/stable
MIT License
161 stars 53 forks source link

Parquet `AssertionErrors` on long running jobs #407

Open NeroCorleone opened 3 years ago

NeroCorleone commented 3 years ago

Problem description

We are seeing different kinds of errors when creating a ktk dataset and it is unclear where these errors come from. Initially, those were AssertionErrors from somewhere in the parquet stack. More recently, we have seen: Exception: OSError('IOError: ZSTD decompression failed: Corrupted block detected',) on a dask worker node.

Example code (ideally copy-pastable)

Unfortunately not so easy: essentially we are triggering a long running (> 3h) ktk job with kartothek.io.dask.dataframe.update_dataset_from_ddf. During this long running job we sometimes (?) see the following stacktrace:

```python File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/kartothek/serialization/_generic.py", line 120, in restore_dataframe date_as_object=date_as_object, File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/kartothek/serialization/_parquet.py", line 128, in restore_dataframe parquet_file, columns_to_io, predicates_for_pushdown File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/kartothek/serialization/_parquet.py", line 237, in _read_row_groups_into_tables row_group = parquet_file.read_row_group(row, columns=columns) File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 271, in read_row_group use_threads=use_threads) File "pyarrow/_parquet.pyx", line 1079, in pyarrow._parquet.ParquetReader.read_row_group return self.read_row_groups([i], column_indices, use_threads) File "pyarrow/_parquet.pyx", line 1098, in pyarrow._parquet.ParquetReader.read_row_groups check_status(self.reader.get() File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status raise IOError(message) ```

Used versions

``` # pip freeze attrs==20.2.0 azure-common==1.1.25 azure-storage-blob==2.1.0 azure-storage-common==2.1.0 blinker==1.4 bokeh==2.2.1 cffi==1.14.3 chardet==3.0.4 click==7.1.2 cloudpickle==1.5.0 contextvars==2.4 cytoolz==0.10.1 dask==2.30.0 dataclasses==0.7 decorator==4.4.2 distributed==2.30.1+by.1 Flask==1.1.2 fsspec==0.8.4 gunicorn==20.0.4 HeapDict==1.0.1 idna==2.10 immutables==0.14 itsdangerous==1.1.0 Jinja2==2.11.2 kartothek==3.17.0 locket==0.2.0 lz4==3.1.0 MarkupSafe==1.1.1 milksnake==0.1.5 msgpack==1.0.0 numpy==1.19.1 packaging==20.4 pandas==1.1.4 partd==1.1.0 Pillow==7.2.0 pip==19.2.3 prometheus-client==0.8.0 prompt-toolkit==3.0.5 psutil==5.7.3 pyarrow==1.0.1 pycparser==2.20 pydantic==1.7.2 pygelf==0.3.4 pyparsing==2.4.7 python-dateutil==2.8.1 pytz==2020.1 PyYAML==5.3.1 requests==2.24.0 retail-interface==0.21.0 sentry-sdk==0.16.2 setuptools==41.2.0 simplejson==3.17.2 simplekv==0.14.1 six==1.15.0 sortedcontainers==2.2.2 storefact==0.10.0 structlog==20.1.0 tblib==1.7.0 terminaltables==3.1.0 toolz==0.10.0 tornado==6.1 typing-extensions==3.7.4.3 uritools==3.0.0 urllib3==1.25.10 urlquote==1.1.4 voluptuous==0.11.7 wcwidth==0.2.5 Werkzeug==1.0.1 wheel==0.33.6 zict==2.0.0 zstandard==0.14.0 ```

Debugging the issue hints towards some improper fetch in our io buffer but the root cause is unknown. The issue might be triggered by a non-threadsafe reader in pyarrow, a bug in our azure storage backend or the buffer itself, see also #402

fjetter commented 3 years ago

Very likely caused by https://github.com/Azure/azure-sdk-for-python/issues/16723