We are seeing different kinds of errors when creating a ktk dataset and it is unclear where these errors come from. Initially, those were AssertionErrors from somewhere in the parquet stack. More recently, we have seen:
Exception: OSError('IOError: ZSTD decompression failed: Corrupted block detected',) on a dask worker node.
Example code (ideally copy-pastable)
Unfortunately not so easy: essentially we are triggering a long running (> 3h) ktk job with kartothek.io.dask.dataframe.update_dataset_from_ddf. During this long running job we sometimes (?) see the following stacktrace:
```python
File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/kartothek/serialization/_generic.py", line 120, in restore_dataframe
date_as_object=date_as_object,
File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/kartothek/serialization/_parquet.py", line 128, in restore_dataframe
parquet_file, columns_to_io, predicates_for_pushdown
File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/kartothek/serialization/_parquet.py", line 237, in _read_row_groups_into_tables
row_group = parquet_file.read_row_group(row, columns=columns)
File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/pyarrow/parquet.py", line 271, in read_row_group
use_threads=use_threads)
File "pyarrow/_parquet.pyx", line 1079, in pyarrow._parquet.ParquetReader.read_row_group
return self.read_row_groups([i], column_indices, use_threads)
File "pyarrow/_parquet.pyx", line 1098, in pyarrow._parquet.ParquetReader.read_row_groups
check_status(self.reader.get()
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
raise IOError(message)
```
Debugging the issue hints towards some improper fetch in our io buffer but the root cause is unknown. The issue might be triggered by a non-threadsafe reader in pyarrow, a bug in our azure storage backend or the buffer itself, see also #402
Problem description
We are seeing different kinds of errors when creating a ktk dataset and it is unclear where these errors come from. Initially, those were
AssertionErrors
from somewhere in the parquet stack. More recently, we have seen:Exception: OSError('IOError: ZSTD decompression failed: Corrupted block detected',)
on a dask worker node.Example code (ideally copy-pastable)
Unfortunately not so easy: essentially we are triggering a long running (> 3h) ktk job with
kartothek.io.dask.dataframe.update_dataset_from_ddf
. During this long running job we sometimes (?) see the following stacktrace:Used versions
Debugging the issue hints towards some improper fetch in our io buffer but the root cause is unknown. The issue might be triggered by a non-threadsafe reader in pyarrow, a bug in our azure storage backend or the buffer itself, see also #402