Open Nekit2217 opened 3 months ago
I fixed the error by changing the type of the stats field, but it feels like a workaround
static ref ADD_FIELDS: Vec<ArrowField> = arrow_defs![
path:Utf8,
size:Int64,
modificationTime:Int64,
dataChange:Boolean,
stats:Utf8,
static ref ADD_FIELDS: Vec<ArrowField> = arrow_defs![
path:Utf8,
size:Int64,
modificationTime:Int64,
dataChange:Boolean,
stats:LargeUtf8,
After the change, spark stopped reading data
Utf8 is like to 2gb of data @_@
can you share your table schema and spark error? also please check if there are any huge json in _delta_log if you have stats on string or binary columns and they might create this problem.
i can kind of reproduce it . (plz don't do it if you love your laptop)
import pandas as pd
from deltalake import DeltaTable, write_deltalake
import pyarrow as pa
py_list_ran = {}
for i in range(1, 600):
py_list_ran[chr(97) * 4000 * i] = 10000 * i
py_table = pa.Table.from_pylist([py_list_ran])
write_deltalake("temp5", py_table, engine="rust", configuration={"delta.dataSkippingNumIndexedCols": "-1"})
table = DeltaTable("temp5")
table.create_checkpoint()
print(table.schema())
# table.optimize.compact()
Exception: Json error: whilst decoding field 'add': whilst decoding field 'stats': offset overflow decoding Utf8
EDIT1
you can also recreate it using this it will stop when combined log is more than 2 gb
from deltalake import DeltaTable, write_deltalake, WriterProperties
from deltalake._internal import TableNotFoundError
import pyarrow as pa
for j in range(1, 50):
py_list_ran = {}
for i in range(1, 100):
py_list_ran[chr(97) * 4000 * i] = 10000 * i
py_table = pa.Table.from_pylist([py_list_ran])
write_deltalake("temp5", py_table, engine="rust", configuration={"delta.dataSkippingNumIndexedCols": "-1"}, mode="append")
table = DeltaTable("temp5")
table.create_checkpoint()
print(table.schema())
self._table = RawDeltaTable(
^^^^^^^^^^^^^^
Exception: Json error: whilst decoding field 'add': whilst decoding field 'stats': offset overflow decoding Utf8
Hi, sorry for the delay
Here is the schema:
Spark does not return any errors, it simply ignores the checkpoints that were recorded with stats: LargeUtf8
No huge json logs, 30 KB on average
We currently work around this by using a multi-part checkpoint that is created when running the optimization on the databricks,but this is very slow
Environment
Delta-rs version: 0.18.1
Binding: rust
Bug
What happened: While running kafka-delta-ingest the service crashes with an error when trying to create a checkpoint:
DeltaTable { source: Arrow { source: JsonError("whilst decoding field 'add': while decoding field 'stats': offset overflow decoding Utf8") } }
The error pops up here: https://github.com/apache/arrow-rs/blob/49840ec0f110da5e9a21ce97affd32313d0b720f/arrow-json/src/reader/string_array.rs#L81
What you expected to happen: The checkpoint will be created successfully
How to reproduce it: Not sure how to reproduce this, presumably it would require a table with a large delta log and data volume
More details: Not sure what additional information to provide: Partitioning by year, month, day, and hour; Current transaction number: 00000000000002020757; Approximately 4TB are written to the table per day; Each parquet file is on average 800MB