Closed HenningHolmDE closed 2 weeks ago
I think the IMAP grammar already implies valid UTF-8, so any additional check is unnecessary. Current implementation:
impl<'a> TryFrom<&'a [u8]> for Atom<'a> {
type Error = ValidationError;
fn try_from(value: &'a [u8]) -> Result<Self, Self::Error> {
Self::validate(value)?;
// Safety: `unwrap` can't panic due to `validate`.
Ok(Self(Cow::Borrowed(from_utf8(value).unwrap())))
}
}
The usage of from_utf8
is unnecessary overhead. We can get rid of it by storing &[u8]
instead of &str
inside of Atom
.
If the incoming bytes are guaranteed to be valid UTF-8 and we want to skip checking, using unsafe { String::from_utf8_unchecked(value) };
would probably be the correct way to do this.
However, I don't think we can really assume only ever getting valid UTF-8 for every case out there in the wild. Or does your experience about existing implementations say otherwise?
In my point of view, storing values as raw &[u8]
would basically mean "we have some bytes here but we neither trust them to be valid UTF-8 nor are we willing to check that". This would make sense to me, if UTF-8 is not guaranteed (we should probably introduce/use a different type than "a bunch of bytes") or if converting to String
afterwards is something that happens not so often, so it would make more sense to do the UTF-8 checking there.
But is "give me a string/str" not the common way of using the types?
My understanding:
Self::validate(value)?
checks if bytes are valid IMAPBut is "give me a string/str" not the common way of using the types?
I think the common way is to use the Encoder
to convert the types into human-readable text.
To recap what we discussed in person: I'm rather sure that all str
types in imap-types are guaranteed to be UTF-8. If not, this would be a bug. In the above definition it must be CHAR
instead of CHAR8
. Note: Some ABNF rules come from the ABNF definition itself. These are called "ABNF core rules" and assumed to always be available.
Regarding unsafe
: There was a time in imap-types and -codec where we used from_utf8_unsafe
. But, in the end, it was more important for me to #[deny(unsafe)]
. I'm open to reconsider, and a benchmark would be a great motivation :-)
Generally speaking, I think that having strings instead of bytes is more convinient. So, this is why I decided to use str
when possible. An alternative would be bstr
. I think for Tag
, Atom
, etc. str
is fine. But Literal
would be a good candidate for bstr
.
Thanks for pointing out the flaw in my asumption! With
CHAR = %x01-7F
; any 7-bit US-ASCII character,
; excluding NUL
I'm convinced that IMAP string typed are indeed a subset of UTF-8. 😃
Maybe, if UTF-8 checking turns out to be too expensive at some point in time, it would indeed make sense to switch to something else (perhaps &[u8]
) for the internal representation. Its content could than be verified when creating an instance (as it is now) and when handing out String
/&str
types to the consumer of the API, using String::from_utf8_unchecked
would be safe as the guarantees have been already been ensured before (being a subset of valid UTF-8).
At first, I though of using something like ascii:AsciiStr instead of str
to reflect having ASCII characters better. However, the ascii
crate specifically looks a bit abandoned to me and as the checks for most IMAP types are way more strict than "valid ASCII", a custom implementation seems to make more sense to me.
Closing this issue as my question has been comprehensively answered. Thanks again!
Currently, most of the string types in
imap-types
are based onCow<'a, str>
. Being a Rust string type,str
is guaranteed to only consist of valid UTF-8. Is this restriction actually correct to assume when looking at strings types in IMAP?For example, I could not quickly find anything in the RFC that prevents an Atom from being valid according to the IMAP specification but not being valid UTF-8:
Maybe I am overlooking something important here?