Closed mike-burns closed 2 years ago
The performance on this is probably poor.
I also considered the slice is_ascii
method but that would fail on various pieces of valid plain text with, say, Japanese characters or emoji.
We could also make an outlandish claim, for example that it's plain text if the first 16 bytes are plain text. Maybe that's good enough?
Thanks for the PR. As you allude, this is somewhat problematic and tricky as there's no magic number for text based file. For example the same test here could technically be used for csv, so how do we reconcile that. Also like you state, iterating through the whole buffer is less than ideal. So there's no way to use this kind of detection reliably. To me this falls outside of the scope of this library. Originally it was intended to work with files that expose a deterministic magic number or header within them. Arbitrary text based files do not really fall in that category and opens the library to (perceived) incorrect or unexpected behavior.
This adds detection for the
text/plain
MIME type, with a file extension oftxt
.Detecting plain text is tricky, and it isn't very clearly spelled out in any of the relevant RFCs (822, 2045, 2046). The best I can figure is "no control characters" (00-1F), but even that isn't quite right: tab, newline, and carriage return are control characters that often occur in plain text.
While here, update the README with the list of all the
text
media types.