bojand / infer

Small crate to infer file and MIME type by checking the magic number signature
MIT License
299 stars 28 forks source link

Add support for text/plain #65

Closed mike-burns closed 2 years ago

mike-burns commented 2 years ago

This adds detection for the text/plain MIME type, with a file extension of txt.

Detecting plain text is tricky, and it isn't very clearly spelled out in any of the relevant RFCs (822, 2045, 2046). The best I can figure is "no control characters" (00-1F), but even that isn't quite right: tab, newline, and carriage return are control characters that often occur in plain text.


While here, update the README with the list of all the text media types.

mike-burns commented 2 years ago

The performance on this is probably poor.

I also considered the slice is_ascii method but that would fail on various pieces of valid plain text with, say, Japanese characters or emoji.

We could also make an outlandish claim, for example that it's plain text if the first 16 bytes are plain text. Maybe that's good enough?

bojand commented 2 years ago

Thanks for the PR. As you allude, this is somewhat problematic and tricky as there's no magic number for text based file. For example the same test here could technically be used for csv, so how do we reconcile that. Also like you state, iterating through the whole buffer is less than ideal. So there's no way to use this kind of detection reliably. To me this falls outside of the scope of this library. Originally it was intended to work with files that expose a deterministic magic number or header within them. Arbitrary text based files do not really fall in that category and opens the library to (perceived) incorrect or unexpected behavior.