apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.19k stars 1.17k forks source link

Empty strings in CSV files aren't being interpreted as null when using a `Dictionary(_, Utf8)` #12041

Open rumpuslabs opened 2 months ago

rumpuslabs commented 2 months ago

Describe the bug

Related to #7797

Empty strings in CSV files aren't being interpreted as null when using a Dictionary(_, Utf8)

To Reproduce

Create a simple input.csv file like this:

id,name
1,
2,bob

Run the following code:

#[tokio::main]
async fn main() -> Result<(), DataFusionError> {
    let ctx = SessionContext::new();

    let format = CsvFormat::default();
    let listing_options = ListingOptions::new(Arc::new(format));
    ctx.register_listing_table(
        "input",
        "input.csv",
        listing_options.clone(),
        Some(Arc::new(Schema::new(vec![
            Field::new("id", DataType::Utf8, false),
            Field::new(
                "name",
                DataType::Dictionary(Box::new(DataType::UInt8), Box::new(DataType::Utf8)),
                true,
            ),
        ]))),
        None,
    )
    .await?;

    let results = ctx
        .table("input")
        .await?
        .filter(col("name").is_not_null())?
        .collect()
        .await?;

    let pretty_results = arrow::util::pretty::pretty_format_batches(&results)?.to_string();

    println!("{}", pretty_results);

    Ok(())
}

Expected behavior

I was expecting the output to look like this:

+----+------+
| id | name |
+----+------+
| 2  | bob  |
+----+------+

But the full dataset is returned instead:

+----+------+
| id | name |
+----+------+
| 1  |      |
| 2  | bob  |
+----+------+

Additional context

Tested on v41.0.0

Replace DataType::Dictionary(Box::new(DataType::UInt8), Box::new(DataType::Utf8)) with DataType::Utf8 and it works.

edmondop commented 2 months ago

take

edmondop commented 3 weeks ago

@alamb shouldn't the csv reader also throw an error because "bob" is not a valid dictionary?

alamb commented 3 weeks ago

I agree the discrepancy between UTf8 and Dictionary looks like a bug

@alamb shouldn't the csv reader also throw an error because "bob" is not a valid dictionary?

I think "bob" is a valid value for a DictionaryArray (whose values are Strings)