1aim / DELETE-THIS-mail-types

Encoding/Decoding (roughly read Generating/Parsing) of mails in rust, including limited unicode support
0 stars 0 forks source link

Implement mime sniffing for Resources #15

Open rustonaut opened 6 years ago

rustonaut commented 6 years ago

the mime_sniffer crate is quite web specific and not what we need (it is content-type + data = mime, but we need data = mime or failure)

rustonaut commented 6 years ago

tree_magic seems worth a try but:

  1. it doesn't do any charset detection
    • which is fine for that case we might require a provided mime, as char-sets are not necessary determinable deterministic
  2. the part which loads system mime magic definitions (fdo_magic) is *nix specific
    • fine as we can not use it for now, actually we might consider making it always "opt-in"
rustonaut commented 6 years ago

the problem is tree_magic::from_u8 does never fail it fill always detect some mime, e.g. text/plain or application/octet-stream but thats well suboptimal.

First a utf8 file can contain 0 bytes but sill be detected as octet-stream. Second utf8 does not allow 0xFF as byte (and some other encoding do not either, thats why it's used as magic number) but a file only consisting of 0xFF is still detected as text/plain.

Instead of a brave apporach a cautious is needed. If there are magic numbers use them else bail.

rustonaut commented 6 years ago

There had been some discussion about how to best handle mime sniffing mainly:

  1. in template engines sniffing mimes for appendixes/embeddings like images should be automatic
  2. for "special" user provided content and text it should never be automatic (to unreliable)

The conclusion was that it might be best to make this configurable to some degree

rustonaut commented 6 years ago

there is currently a limited implementation in mail-codec-composition which

  1. uses conduit-media-types to get a media type based on file ending
  2. uses the file command to get a media type + encoding from the content
  3. compares if both match

But this is very limited e.g. .tar.gz/.tgz is not even in conduit-media-types. Also it is currently only used for sniffing embedding media types in templates, so its enough there to be this way.

In future it might make sense to:

  1. have some load resource post process hook or similar which can do sniffing on in the mail-codec crate (or is called by it / as in not specific to composition templates)
  2. have a more solide media type sniffing crate, which prefers failing over detecting the wrong media type (which is what currently most impl. do)