delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.03k stars 365 forks source link

Unable to write high-precision decimal values to Delta table using serde_json/JsonWriter #1778

Closed ryanaston closed 1 month ago

ryanaston commented 9 months ago

Environment

Delta-rs version: 0.15.0 (also tried 0.16.1)

Binding: Rust

Environment:


Bug

What happened: Writes began failing when attempting to insert high-precision decimal values into a Delta table using the JsonWriter with a Vec<serde_json::Value>. Discovered serde_json was deserializing these values as strings in scientific notation which could not be parsed into the Arrow DecimalType:

Generic DeltaTable error: Failed to convert into Arrow schema: Parser error: can't parse the string value 3.9178294781e-6 to decimal

Some digging uncovered the serde_json feature flag "arbitrary_precision" which retains the value in its full form stored in string format, however this too cannot be decoded to an Arrow DecimalType:

Generic DeltaTable error: Failed to convert into Arrow schema: Json error: whilst decoding field 'decimal_col': expected decimal got {"$serde_json::private::Number": "0.0000039178294781"}

What you expected to happen: High-precision decimal values be written accurately and successfully to a Delta table

How to reproduce it:

Cargo.toml

[package]
name = "decimal_issue"
version = "0.1.0"
edition = "2021"

[dependencies]
deltalake = "0.15.0"
serde = "1"
serde_json = { version = "1", features = ["arbitrary_precision"] } # remove feature to see original behavior
tokio = "1.33.0"

src/main.rs

use deltalake::{operations::create::CreateBuilder, writer::{JsonWriter, DeltaWriter}};

#[tokio::main]
async fn main() {
    let data = serde_json::from_str::<Vec<serde_json::Value>>(r#"[{"decimal_col": 0.0000039178294781}]"#).unwrap();
    let table = CreateBuilder::new().with_location("memory://").with_column("decimal_col", deltalake::SchemaDataType::primitive("decimal(38,16)".to_string()), true, None).await.unwrap();
    let mut writer = JsonWriter::for_table(&table).unwrap();

    match writer.write(data).await {
        Ok(_) => {},
        Err(err) => {
            eprintln!("{}", err);
            std::process::exit(1);
        }
    }
}

cargo run

More details: Lower precision decimals (5 or less) do not have this issue.

Using arrow_json to parse the value into a RecordBatch and then using RecordBatchWriter instead of serde_json with JsonWriter works, however the problem here is other Delta log interactions such as create_checkpoint use serde_json behind the scenes, so when the stats are read from the logs to be written to Parquet checkpoints the same issue occurs.

ryanaston commented 8 months ago

Update:

There are several concerns going on here. First, there are shortcomings in arrow causing issues with arbitrary_precision and scientific notation. I have opened two feature requests in the arrow-rs project to address these:

  1. https://github.com/apache/arrow-rs/issues/5068
  2. https://github.com/apache/arrow-rs/issues/5069

Second, delta-rs is using f64 as a stand-in for decimals, causing precision loss. I know Rust does not have a native decimal type, but this seems like a big oversight. For now I've added the BigDecimal crate to a fork of this library. If this seems like the right direction for delta-rs broadly I'm happy to clean it up and submit a PR to this repo.

roeap commented 6 months ago

@ryanaston - of course we always happy about PRs.

In this case we may have the challenge, that we need to be true to the delta protocol, which at most supports precision / scale up to 38.

However there may be a bug regarding writing decimal values through the json writer anyways. Does this error only apply to high-precision decimals, or decimals in general?