apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.33k stars 684 forks source link

Use File Format quote when inferring the schema for CSVFormat #5729

Open joao-p-pereira opened 2 months ago

joao-p-pereira commented 2 months ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When using ListingOptions to infer the schema of a ListingTableUrl the result schema does not take into account the quote defined in the format. This will make all the schema columns that have the quote present to be inferred as utf8.

Describe the solution you'd like

Infer_schema should take into account the file format quote when inferring the schemas, so the inferred type can be the more specific possible.

Describe alternatives you've considered

Additional context

Lordworms commented 1 month ago

I can do this one

Lordworms commented 1 month ago

Seems like it is not a bug here, we could directly pass the quote to Format struct and get the correct answer suppose we have a csv file like

image

and writing a test like

fn test_with_quote() {
        let mut file =
            File::open("/Users/yxiang1/work/arrow-rs/arrow-csv/test/data/quote.csv").unwrap();
        let (schema, _) = Format::default().infer_schema(&mut file, None).unwrap();

        println!("did not pass quote schema is {:?}", schema);

        let mut file =
            File::open("test/data/quote.csv").unwrap();

        let (schema, _) = Format::default()
            .with_quote(b'\'')
            .infer_schema(&mut file, None)
            .unwrap();
        println!("after pass schema is {:?}", schema);
    }

we could pass the single quote to Format and get different results like

image