Closed ryanaston closed 1 month ago
Update:
There are several concerns going on here. First, there are shortcomings in arrow causing issues with arbitrary_precision and scientific notation. I have opened two feature requests in the arrow-rs project to address these:
Second, delta-rs is using f64 as a stand-in for decimals, causing precision loss. I know Rust does not have a native decimal type, but this seems like a big oversight. For now I've added the BigDecimal crate to a fork of this library. If this seems like the right direction for delta-rs broadly I'm happy to clean it up and submit a PR to this repo.
@ryanaston - of course we always happy about PRs.
In this case we may have the challenge, that we need to be true to the delta protocol, which at most supports precision / scale up to 38.
However there may be a bug regarding writing decimal values through the json writer anyways. Does this error only apply to high-precision decimals, or decimals in general?
Environment
Delta-rs version: 0.15.0 (also tried 0.16.1)
Binding: Rust
Environment:
Bug
What happened: Writes began failing when attempting to insert high-precision decimal values into a Delta table using the JsonWriter with a
Vec<serde_json::Value>
. Discoveredserde_json
was deserializing these values as strings in scientific notation which could not be parsed into the Arrow DecimalType:Generic DeltaTable error: Failed to convert into Arrow schema: Parser error: can't parse the string value 3.9178294781e-6 to decimal
Some digging uncovered the
serde_json
feature flag"arbitrary_precision"
which retains the value in its full form stored in string format, however this too cannot be decoded to an Arrow DecimalType:Generic DeltaTable error: Failed to convert into Arrow schema: Json error: whilst decoding field 'decimal_col': expected decimal got {"$serde_json::private::Number": "0.0000039178294781"}
What you expected to happen: High-precision decimal values be written accurately and successfully to a Delta table
How to reproduce it:
Cargo.toml
src/main.rs
cargo run
More details: Lower precision decimals (5 or less) do not have this issue.
Using
arrow_json
to parse the value into a RecordBatch and then using RecordBatchWriter instead ofserde_json
with JsonWriter works, however the problem here is other Delta log interactions such as create_checkpoint useserde_json
behind the scenes, so when the stats are read from the logs to be written to Parquet checkpoints the same issue occurs.