delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.35k stars 414 forks source link

JsonWriter does not respect delta.dataSkippingStatsColumns #2982

Closed jusjosj closed 4 days ago

jusjosj commented 2 weeks ago

Environment

Delta-rs version: 0.21.0

Binding: Rust

Environment:

What happened: JsonWriter and RecordBatchWriter do not respect delta.dataSkippingStatsColumns

What you expected to happen: delta.dataSkippingStatsColumns should be able to limit stats collection

How to reproduce it: Run the following test case for json.rs and observe delta.dataSkippingStatsColumns set properly, but the stats record containing values for all columns instead of just id


        async fn test_json_write_dataskipping() {
            use std::fs::File;
            use std::io::Read;
            use crate::operations::create::CreateBuilder;

            let table_dir = tempfile::tempdir().unwrap();
            let schema = get_delta_schema();
            let path = table_dir.path().to_str().unwrap().to_string();
            let config: HashMap<String, Option<String>> =
            vec![("delta.dataSkippingStatsColumns".to_string(), Some("id".to_string()))]
                .into_iter()
                .collect();
            let mut table = CreateBuilder::new()
                .with_location(&path)
                .with_table_name("test-table")
                .with_comment("A table for running tests")
                .with_columns(schema.fields().cloned())
                .with_configuration(config)
                .await
                .unwrap();
            table.load().await.expect("Failed to load table");
            assert_eq!(table.version(), 0);
            let mut writer = JsonWriter::for_table(&table).unwrap();
            let data = serde_json::json!(
                {
                    "id" : "A",
                    "value": 42,
                    "modified": "2021-02-01"
                }
            );

            writer.write(vec![data]).await.unwrap();
            writer.flush_and_commit(&mut table).await.unwrap();
            assert_eq!(table.version(), 1);
            let log_file_path = path.clone() + "/_delta_log/00000000000000000000.json";
            let mut log_file = File::open(log_file_path).unwrap();
            let mut contents = String::new();
            log_file.read_to_string(&mut contents).unwrap();
            println!("{}", contents);
            println!("");
            let log_file_path = path + "/_delta_log/00000000000000000001.json";
            let mut log_file = File::open(log_file_path).unwrap();
            let mut contents = String::new();
            log_file.read_to_string(&mut contents).unwrap();
            println!("{}", contents);
        }

More details:

Issue is caused by this block of code in the flush function where it is hardcoding the stats_collection information when creating the add actions instead of pulling this information from the table.

            actions.push(create_add(
                &writer.partition_values,
                path.to_string(),
                file_size,
                &metadata,
                DEFAULT_NUM_INDEX_COLS,
                &None,
            )?);
        }
jusjosj commented 2 weeks ago

I am willing to submit a fix for this issue and have sometime next week to do so.

The way I plan to tackle this is to add stats fields for JsonWriter struct then hydrate them from the DeltaTable object in try_new or for_table. If the stats settings are not present, I will default the fields to the existing defaults.