apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.55k stars 764 forks source link

Reading json `map` with non-nullable value schema doesn't error if values are actually null #6391

Open nicklan opened 1 month ago

nicklan commented 1 month ago

Describe the bug

If you use an arrow_json::ReaderBuilder to read a json file, and specify a schema that includes a map that shouldn't allow nullable values, you can still read files that have nulls in the actual json map.

To Reproduce

use std::{fs::File, io::BufReader, sync::Arc};

use arrow::datatypes::{DataType, Field, Schema};

fn main() {
    let schema = Arc::new(Schema::new(vec![
        Field::new("str", DataType::Utf8, false),
        Field::new_map(
            "map",
            "entries",
            Field::new("key", DataType::Utf8, false),
            Field::new("value", DataType::Utf8, false), // value is not nullable
            false,
            false
        )
    ]));

    let file = File::open("test.json").unwrap();

    let mut json = arrow_json::ReaderBuilder::new(schema).build(BufReader::new(file)).unwrap();
    let batch = json.next().unwrap().unwrap();
    println!("Batch: {batch:#?}");
}

And use this json file:

{
  "str": "s",
  "map":  {
    "key": null
  }
}

Running produces:

Batch: RecordBatch {
    schema: Schema {
        fields: [
            Field {
                name: "str",
                data_type: Utf8,
                nullable: false,
                dict_id: 0,
                dict_is_ordered: false,
                metadata: {},
            },
            Field {
                name: "map",
                data_type: Map(
                    Field {
                        name: "entries",
                        data_type: Struct(
                            [
                                Field {
                                    name: "key",
                                    data_type: Utf8,
                                    nullable: false,
                                    dict_id: 0,
                                    dict_is_ordered: false,
                                    metadata: {},
                                },
                                Field {
                                    name: "value",
                                    data_type: Utf8,
                                    nullable: false,
                                    dict_id: 0,
                                    dict_is_ordered: false,
                                    metadata: {},
                                },
                            ],
                        ),
                        nullable: false,
                        dict_id: 0,
                        dict_is_ordered: false,
                        metadata: {},
                    },
                    false,
                ),
                nullable: false,
                dict_id: 0,
                dict_is_ordered: false,
                metadata: {},
            },
        ],
        metadata: {},
    },
    columns: [
        StringArray
        [
          "s",
        ],
        MapArray
        [
          StructArray
        [
        -- child 0: "key" (Utf8)
        StringArray
        [
          "key",
        ]
        -- child 1: "value" (Utf8)
        StringArray
        [
          null,
        ]
        ],
        ],
    ],
    row_count: 1,
}

Note I've included the str field so you can easily see that the right thing happens if you change your .json file to

{
  "str": null,
  "map":  {
    "key": null
  }
}

You will get:

called `Result::unwrap()` on an `Err` value: JsonError("Encountered unmasked nulls in non-nullable StructArray child: Field { name: \"str\", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }")

Expected behavior

Expect an error similar to what happens when str field is set to null.

Additional context

nicklan commented 5 days ago

Ohh jeez, github automatically closed this due to a PR I made that just mentions it. This is not fixed!