Closed roomthily closed 9 years ago
See 5e73ca6 for two new methods - replace any mimetype in a blob of text and extract known mimetypes (and replace in a blob of text).
Note the included corpus is not complete. Final corpus will be based on the IANA set plus any geospatial mimetypes (not generally found in the IANA list) and any other mimetypes related.
As fantastic as this topic is:
it is not so great for LDA results. Shifting the mimetype identification to some other feature classification anyway.
So exclude mimetypes from bag of words.