BurntSushi / rust-csv

A CSV parser for Rust, with Serde support.
The Unlicense
1.72k stars 219 forks source link

Deserializing a String field inside a flattende struct fails if the field contains a valid integer #344

Open tyilo opened 11 months ago

tyilo commented 11 months ago

What version of the csv crate are you using?

1.3.0

Briefly describe the question, bug or feature request.

When deserializing to a String field using serde, csv can handle a field containing 123. However, when doing the same to a flattened field, csv can no longer deserialize 123 to the String field and I get the error:

Err(
    Error(
        Deserialize {
            pos: Some(
                Position {
                    byte: 24,
                    line: 4,
                    record: 3,
                },
            ),
            err: DeserializeError {
                field: None,
                kind: Message(
                    "invalid type: integer `123`, expected a string",
                ),
            },
        },
    ),
)

Include a complete program demonstrating a problem.

use serde::Deserialize;

#[allow(dead_code)]
#[derive(Debug, Deserialize)]
struct Inner {
    inner_str: String,
}

#[allow(dead_code)]
#[derive(Debug, Deserialize)]
struct Row {
    str: String,

    #[serde(flatten)]
    inner: Inner,
}

fn main() {
    let source = r#"
str,inner_str
A,A
123,A
A,123
    "#
    .trim();

    let mut reader = csv::Reader::from_reader(source.as_bytes());
    for line in reader.deserialize::<Row>() {
        dbg!(&line);
    }
}

What is the observed behavior of the code above?

Output:

[src/main.rs:29] &line = Ok(
    Row {
        str: "A",
        inner: Inner {
            inner_str: "A",
        },
    },
)
[src/main.rs:29] &line = Ok(
    Row {
        str: "123",
        inner: Inner {
            inner_str: "A",
        },
    },
)
[src/main.rs:29] &line = Err(
    Error(
        Deserialize {
            pos: Some(
                Position {
                    byte: 24,
                    line: 4,
                    record: 3,
                },
            ),
            err: DeserializeError {
                field: None,
                kind: Message(
                    "invalid type: integer `123`, expected a string",
                ),
            },
        },
    ),
)

What is the expected or desired behavior of the code above?

Output:

[src/main.rs:29] &line = Ok(
    Row {
        str: "A",
        inner: Inner {
            inner_str: "A",
        },
    },
)
[src/main.rs:29] &line = Ok(
    Row {
        str: "123",
        inner: Inner {
            inner_str: "A",
        },
    },
)
[src/main.rs:29] &line = Ok(
    Row {
        str: "A",
        inner: Inner {
            inner_str: "123",
        },
    },
)
BurntSushi commented 11 months ago

This does indeed look like a bug. I don't know when I'll have time to look into it, so patches are welcome if you can find a fix yourself.

I'll note that the presence of serde(flatten) concerns me. IIRC, it required some hacky things to make it work and it may not have been done correctly.

tyilo commented 11 months ago

It seems like serde is calling deserialize_string for str, but deserialize_any for inner_str. Is this a bug in serde?

BurntSushi commented 11 months ago

I don't know. I don't know much about serde internals. I do know that the serde(flatten) thing is a bit of a tortured feature though. It's one of those things that folks don't have to actually use, but it affords some very nice conveniences. But it just doesn't work in all cases.

I won't have time to look into this any time soon I'm afraid.

tyilo commented 11 months ago

Yeah, it seems to be unsupported in serde: https://github.com/serde-rs/serde/issues/1881

That's unfortunate :/

Anders429 commented 3 months ago

I believe the issue here is that CSV is not actually a "self-describing" format, and therefore should not support deserialize_any(). A single element in a CSV does not encode it's type; you can't tell if 12345 is an integer or a string. The deserializer needs to rely on the the type to tell it how to interpret the data, it can't determine it itself.

#[serde(flatten)] is only supported for self-describing formats because it needs to parse the tokens ahead of time using deserialize_any(), since it doesn't know what types the nested fields contain yet. Since the deserialization then simply "guesses" the type, it might guess wrong and you get issues like this.

As far as I can tell, there's no way to actually fix this without fundamentally changing the csv format to be actually self-describing. IMO there shouldn't be a deserialize_any() implementation for this format at all, although that's probably hard to walk back at this point.

See also: serde.rs's documentation for using deserialize_any().

BurntSushi commented 3 months ago

@Anders429 That is a very insightful comment, thank you. I don't think I realized myself at all that flatten support depended on deserialize_any.

Anders429 commented 3 months ago

It's definitely surprising, because you could easily implement most #[serde(flatten)] use cases by hand without needing to call deserialize_any(). I found a couple issues on serde's repository that ask questions about this.

Seems like flattening is hard to support for serde's derive macro, because the macro can't know anything about nested types, and the nested types can't be partially deserialized. It's a limitation that makes flattening pretty cursed; it seems that quick-xml has the same type of issue: https://github.com/tafia/quick-xml/issues/286.

hydra commented 3 months ago

I ran into the exact same issue today, here's my code:

#[derive(Debug, serde::Deserialize)]
#[serde(rename_all(deserialize = "PascalCase"))]
pub struct PartMappingRecord {
    eda: CSVEdaToolValue,

    manufacturer: String,
    mpn: String,

    #[serde(flatten)]
    fields: HashMap<String, String>,
}

the csv file:

"Eda","Manufacturer","Mpn","Name","Value"
"DipTrace","Molex","0479480001","0479480001",""

the error:

Error: Deserializing part mapping record

Caused by:
    CSV deserialize error: record 1 (line: 1, byte: 42): invalid type: integer `479480001`, expected a string

In the above example, the "Mpn" column's value is ok, but the "Name" column, which is flattened into fields is produces an error.

the source: https://github.com/MakerPnP/makerpnp/blob/master/src/stores/part_mappings.rs#L20

For my use-case, I need to support any number of arbitrarily named csv columns, since each different EDA tool that generates placement CSV files will always have their own column names and the part mappings file has to use the same column names too.

Additionally, the 'line' counter seems to be 0 based, which is confusing to humans, since most text editors and IDEs display line numbers starting from 1, for example:

image

Suggestions on workarounds or solutions greatly appreciated!

BurntSushi commented 3 months ago

I think something folks sometimes forget is that serde is an optional convenience feature. You don't actually need to use it! You can use csv::StringRecord and it will handle literally everything and give you infinite flexibility.

BurntSushi commented 3 months ago

Additionally, the 'line' counter seems to be 0 based, which is confusing to humans, since most text editors and IDEs display line numbers starting from 1,

It's not. Or it's not intended to be: https://docs.rs/csv/latest/csv/struct.Position.html#method.line

If you are experiencing the opposite, please open a new issue with an MRE.

hydra commented 3 months ago

I think something folks sometimes forget is that serde is an optional convenience feature. You don't actually need to use it! You can use csv::StringRecord and it will handle literally everything and give you infinite flexibility.

My project uses serde with serde-derive for other serialization tasks so it made sense to use it for CSV files too. I'm just about to start investigating a solution/work-around to this issue and will keep that in mind. thanks for the timely reminder!

Additionally, the 'line' counter seems to be 0 based, which is confusing to humans, since most text editors and IDEs display line numbers starting from 1,

It's not. Or it's not intended to be: https://docs.rs/csv/latest/csv/struct.Position.html#method.line

If you are experiencing the opposite, please open a new issue with an MRE.

ok, I will check again and future errors and will open issues as required.