BurntSushi / rust-csv

A CSV parser for Rust, with Serde support.
The Unlicense
1.69k stars 219 forks source link

Feature request: please add `invalid_result` deserializer #345

Open lucatrv opened 9 months ago

lucatrv commented 9 months ago

When dealing with invalid data with serde, sometimes it is useful to get also invalid fields when present. The current invalid_option deserializer discards all invalid fields. It would be convenient to have a similar invalid_result deserializer, which returns Ok(value) for all valid fields, and Err(string_field) for all invalid fields.

BurntSushi commented 9 months ago

If this is possible to do in a small patch, like adding a new invalid_result free function, then I'd be happy to take a PR.

lucatrv commented 9 months ago

I could create a PR when I have some time... do you see any other need apart from adding the invalid_result function and document it?

lucatrv commented 9 months ago

@BurntSushi, for now I implemented the following function:

pub fn invalid_result<'de, D, T>(
    de: D,
) -> result::Result<std::result::Result<T, String>, D::Error>
where
    D: Deserializer<'de>,
    T: Deserialize<'de>,
{
    Ok(T::deserialize(de).or_else(|err| Err(err.to_string())))
}

This works but reports the error message, not the original invalid string value as it should be. So for instance for good values, empty strings, or invalid strings, I am getting respectively:

Record { latitude: 60.5544444, longitude: -151.2583333, population: Ok(7610), city: "Kenai", state: "AK" }

Record { latitude: 65.2419444, longitude: -165.2716667, population: Err("field 2: cannot parse integer from empty string"), city: "Davidsons Landing", state: "AK" }

Record { latitude: 37.3433333, longitude: -86.7136111, population: Err("field 2: invalid digit found in string"), city: "Flint Springs", state: "KY" }

Do you know how to get the original invalid string value from within the function?

lucatrv commented 9 months ago

@BurntSushi, sorry to bother you but I've been trying to implement this for a while without success, and I can't really figure out how to solve this. I'm now wondering if it is feasible at all... should the Error type be extended to report the original csv string? If you can give me some hints on how to modify the function that I reported above (if feasible) then I will complete my PR and issue it. Thanks

lucatrv commented 8 months ago

@BurntSushi I finally got two working functions. The first one is simpler and relies on Content.as_str() for String conversion, thus it can fail. In the second function instead I manually managed each Content variant, applying lossy string conversion when appropriate, so it never fails. However I'm not sure if I managed correctly the None / Some / Unit / Newtype / Seq / Map variants.

pub fn invalid_result<'de, D, T>(
    de: D,
) -> result::Result<result::Result<T, String>, D::Error>
where
    D: Deserializer<'de>,
    T: Deserialize<'de>,
{
    // Need to use `Content` and `ContentRefDeserializer`, which currently are in
    // Serde private API but may become public in the future, see:
    // https://github.com/serde-rs/serde/issues/741
    // https://github.com/arcnmx/serde-value/issues/16
    let content = serde::__private::de::Content::deserialize(de)?;
    let de = serde::__private::de::ContentRefDeserializer::<D::Error>::new(
        &content,
    );
    if let Ok(t) = T::deserialize(de) {
        Ok(Ok(t))
    } else if let Some(s) = content.as_str() {
        Ok(Err(s.to_owned()))
    } else {
        Err(serde::de::Error::custom("input data is not a valid UTF-8 string"))
    }
}
pub fn invalid_result<'de, D, T>(
    de: D,
) -> result::Result<result::Result<T, String>, D::Error>
where
    D: Deserializer<'de>,
    T: Deserialize<'de>,
{
    // Need to use `Content` and `ContentRefDeserializer`, which currently are in
    // Serde private API but may become public in the future, see:
    // https://github.com/serde-rs/serde/issues/741
    // https://github.com/arcnmx/serde-value/issues/16
    let content = serde::__private::de::Content::deserialize(de)?;
    let de = serde::__private::de::ContentRefDeserializer::<D::Error>::new(
        &content,
    );
    let result = if let Ok(t) = T::deserialize(de) {
        Ok(t)
    } else {
        Err(match content {
            serde::__private::de::Content::Bool(b) => b.to_string(),
            serde::__private::de::Content::U8(u) => u.to_string(),
            serde::__private::de::Content::U16(u) => u.to_string(),
            serde::__private::de::Content::U32(u) => u.to_string(),
            serde::__private::de::Content::U64(u) => u.to_string(),
            serde::__private::de::Content::I8(i) => i.to_string(),
            serde::__private::de::Content::I16(i) => i.to_string(),
            serde::__private::de::Content::I32(i) => i.to_string(),
            serde::__private::de::Content::I64(i) => i.to_string(),
            serde::__private::de::Content::F32(f) => f.to_string(),
            serde::__private::de::Content::F64(f) => f.to_string(),
            serde::__private::de::Content::Char(c) => c.to_string(),
            serde::__private::de::Content::String(s) => s,
            serde::__private::de::Content::Str(s) => s.to_owned(),
            serde::__private::de::Content::ByteBuf(buf) => {
                String::from_utf8_lossy(&buf).into_owned()
            }
            serde::__private::de::Content::Bytes(bytes) => {
                String::from_utf8_lossy(bytes).into_owned()
            }
            serde::__private::de::Content::None => String::new(),
            serde::__private::de::Content::Some(some) => {
                format!("{:?}", some)
            }
            serde::__private::de::Content::Unit => String::new(),
            serde::__private::de::Content::Newtype(newtype) => {
                format!("{:?}", newtype)
            }
            serde::__private::de::Content::Seq(seq) => format!("{:?}", seq),
            serde::__private::de::Content::Map(m) => format!("{:?}", m),
        })
    };
    Ok(result)
}

I prefer the second function, but please let me know your preference and advice, or otherwise if I should use serde-value as reported in the following docstring. After I get your answer I can create a PR.

/// Used from generated code to buffer the contents of the Deserializer when
/// deserializing untagged enums and internally tagged enums.
///
/// Not public API. Use serde-value instead.
#[derive(Debug, Clone)]
pub enum Content<'de> {
lucatrv commented 7 months ago

@BurntSushi, I think I found a better implementation. I hope you like it, the only drawback is that it requires Serde's derive feature, but I cannot find other valid solutions (either relying on Serde's private API, or on serde-value, or on Serde's derive feature). I am going to create a PR with this implementation.

pub fn invalid_result<'de, D, T, F>(
    de: D,
) -> result::Result<result::Result<T, F>, D::Error>
where
    D: Deserializer<'de>,
    T: Deserialize<'de>,
    F: Deserialize<'de>,
{
    #[derive(Deserialize)]
    #[serde(untagged)]
    enum DeFallback<T, F> {
        Expected(T),
        Fallback(F),
    }

    DeFallback::<T, F>::deserialize(de).map(|d| match d {
        DeFallback::Expected(t) => Ok(t),
        DeFallback::Fallback(f) => Err(f),
    })
}
lucatrv commented 7 months ago

Never mind, the above code using Serde's derive feature does not work as expected. For instance if a column contains integers, strings and floats, and one tries to deserialize it to Result<i64, String> to retain all invalid values (floats and strings) as Strings, the deserialization would error out at the first float.

So at the moment the only two options that I could find are either relying on Serde's private API or on serde-value:

pub fn invalid_result_private<'de, D, T>(
    de: D,
) -> result::Result<result::Result<T, String>, D::Error>
where
    D: Deserializer<'de>,
    T: Deserialize<'de>,
{
    let content = serde::__private::de::Content::deserialize(de)?;
    let de = serde::__private::de::ContentRefDeserializer::<D::Error>::new(
        &content,
    );
    let result = T::deserialize(de).map_err(|_| match content {
        serde::__private::de::Content::Bool(b) => b.to_string(),
        serde::__private::de::Content::U8(u) => u.to_string(),
        serde::__private::de::Content::U16(u) => u.to_string(),
        serde::__private::de::Content::U32(u) => u.to_string(),
        serde::__private::de::Content::U64(u) => u.to_string(),
        serde::__private::de::Content::I8(i) => i.to_string(),
        serde::__private::de::Content::I16(i) => i.to_string(),
        serde::__private::de::Content::I32(i) => i.to_string(),
        serde::__private::de::Content::I64(i) => i.to_string(),
        serde::__private::de::Content::F32(f) => f.to_string(),
        serde::__private::de::Content::F64(f) => f.to_string(),
        serde::__private::de::Content::Char(c) => c.to_string(),
        serde::__private::de::Content::String(s) => s,
        serde::__private::de::Content::Str(s) => s.to_owned(),
        serde::__private::de::Content::ByteBuf(buf) => {
            String::from_utf8_lossy(&buf).into_owned()
        }
        serde::__private::de::Content::Bytes(bytes) => {
            String::from_utf8_lossy(bytes).into_owned()
        }
        serde::__private::de::Content::None => String::new(),
        serde::__private::de::Content::Some(some) => {
            format!("{:?}", some)
        }
        serde::__private::de::Content::Unit => String::new(),
        serde::__private::de::Content::Newtype(newtype) => {
            format!("{:?}", newtype)
        }
        serde::__private::de::Content::Seq(seq) => format!("{:?}", seq),
        serde::__private::de::Content::Map(map) => format!("{:?}", map),
    });
    Ok(result)
}
pub fn invalid_result_serdevalue<'de, D, T>(
    de: D,
) -> result::Result<result::Result<T, String>, D::Error>
where
    D: Deserializer<'de>,
    T: Deserialize<'de>,
{
    let value = serde_value::Value::deserialize(de)?;
    let result = T::deserialize(value.clone()).map_err(|_| match value {
        serde_value::Value::Bool(b) => b.to_string(),
        serde_value::Value::U8(u) => u.to_string(),
        serde_value::Value::U16(u) => u.to_string(),
        serde_value::Value::U32(u) => u.to_string(),
        serde_value::Value::U64(u) => u.to_string(),
        serde_value::Value::I8(i) => i.to_string(),
        serde_value::Value::I16(i) => i.to_string(),
        serde_value::Value::I32(i) => i.to_string(),
        serde_value::Value::I64(i) => i.to_string(),
        serde_value::Value::F32(f) => f.to_string(),
        serde_value::Value::F64(f) => f.to_string(),
        serde_value::Value::Char(c) => c.to_string(),
        serde_value::Value::String(s) => s,
        serde_value::Value::Unit => String::new(),
        serde_value::Value::Option(option) => {
            format!("{:?}", option)
        }
        serde_value::Value::Newtype(newtype) => {
            format!("{:?}", newtype)
        }
        serde_value::Value::Seq(seq) => format!("{:?}", seq),
        serde_value::Value::Map(map) => format!("{:?}", map),
        serde_value::Value::Bytes(bytes) => {
            String::from_utf8_lossy(&bytes).into_owned()
        }
    });
    Ok(result)
}