Are IMAP string types actually valid UTF-8?

duesee / imap-codec

Rock-solid and complete codec for IMAP

Apache License 2.0

35 stars 13 forks source link

Are IMAP string types actually valid UTF-8? #506

Closed HenningHolmDE closed 2 weeks ago

HenningHolmDE commented 2 weeks ago

Currently, most of the string types in imap-types are based on Cow<'a, str>. Being a Rust string type, str is guaranteed to only consist of valid UTF-8. Is this restriction actually correct to assume when looking at strings types in IMAP?

For example, I could not quickly find anything in the RFC that prevents an Atom from being valid according to the IMAP specification but not being valid UTF-8:

atom            = 1*ATOM-CHAR
ATOM-CHAR       = <any CHAR except atom-specials>
atom-specials   = "(" / ")" / "{" / SP / CTL / list-wildcards /
                  quoted-specials / resp-specials
...
CHAR8           = %x01-ff
                    ; any OCTET except NUL, %x00

Maybe I am overlooking something important here?

jakoschiko commented 2 weeks ago

I think the IMAP grammar already implies valid UTF-8, so any additional check is unnecessary. Current implementation:

impl<'a> TryFrom<&'a [u8]> for Atom<'a> {
    type Error = ValidationError;

    fn try_from(value: &'a [u8]) -> Result<Self, Self::Error> {
        Self::validate(value)?;

        // Safety: `unwrap` can't panic due to `validate`.
        Ok(Self(Cow::Borrowed(from_utf8(value).unwrap())))
    }
}

The usage of from_utf8 is unnecessary overhead. We can get rid of it by storing &[u8] instead of &str inside of Atom.

HenningHolmDE commented 2 weeks ago

If the incoming bytes are guaranteed to be valid UTF-8 and we want to skip checking, using unsafe { String::from_utf8_unchecked(value) }; would probably be the correct way to do this. However, I don't think we can really assume only ever getting valid UTF-8 for every case out there in the wild. Or does your experience about existing implementations say otherwise?

In my point of view, storing values as raw &[u8] would basically mean "we have some bytes here but we neither trust them to be valid UTF-8 nor are we willing to check that". This would make sense to me, if UTF-8 is not guaranteed (we should probably introduce/use a different type than "a bunch of bytes") or if converting to String afterwards is something that happens not so often, so it would make more sense to do the UTF-8 checking there. But is "give me a string/str" not the common way of using the types?

jakoschiko commented 2 weeks ago

My understanding:

Self::validate(value)? checks if bytes are valid IMAP
Valid IMAP is a subset of valid UTF-8

But is "give me a string/str" not the common way of using the types?

I think the common way is to use the Encoder to convert the types into human-readable text.

duesee commented 2 weeks ago

To recap what we discussed in person: I'm rather sure that all str types in imap-types are guaranteed to be UTF-8. If not, this would be a bug. In the above definition it must be CHAR instead of CHAR8. Note: Some ABNF rules come from the ABNF definition itself. These are called "ABNF core rules" and assumed to always be available.

Regarding unsafe: There was a time in imap-types and -codec where we used from_utf8_unsafe. But, in the end, it was more important for me to #[deny(unsafe)]. I'm open to reconsider, and a benchmark would be a great motivation :-)

Generally speaking, I think that having strings instead of bytes is more convinient. So, this is why I decided to use str when possible. An alternative would be bstr. I think for Tag, Atom, etc. str is fine. But Literal would be a good candidate for bstr.

HenningHolmDE commented 2 weeks ago

Thanks for pointing out the flaw in my asumption! With

         CHAR           =  %x01-7F
                                ; any 7-bit US-ASCII character,
                                ;  excluding NUL

I'm convinced that IMAP string typed are indeed a subset of UTF-8. 😃

Maybe, if UTF-8 checking turns out to be too expensive at some point in time, it would indeed make sense to switch to something else (perhaps &[u8]) for the internal representation. Its content could than be verified when creating an instance (as it is now) and when handing out String/&str types to the consumer of the API, using String::from_utf8_unchecked would be safe as the guarantees have been already been ensured before (being a subset of valid UTF-8).

At first, I though of using something like ascii:AsciiStr instead of str to reflect having ASCII characters better. However, the ascii crate specifically looks a bit abandoned to me and as the checks for most IMAP types are way more strict than "valid ASCII", a custom implementation seems to make more sense to me.

Closing this issue as my question has been comprehensively answered. Thanks again!