b-cube / semantics-preprocessing

initial text preprocessors for the triplestore and feature classification
Other
2 stars 3 forks source link

Bag of Words: mimetypes! #55

Closed roomthily closed 9 years ago

roomthily commented 9 years ago

As fantastic as this topic is:

u'0.134*image/tiff + 0.134*image/svg+xml + 0.133*application/vnd.google-earth.kmz + 0.133*application/vnd.google-earth.kml+xml + 0.129*application/x-pdf + 0.050*application/vnd.ogc.gml + 0.049*text/plain + 0.048*text/html + 0.009*associate + 0.007*eng'

it is not so great for LDA results. Shifting the mimetype identification to some other feature classification anyway.

So exclude mimetypes from bag of words.

roomthily commented 9 years ago

See 5e73ca6 for two new methods - replace any mimetype in a blob of text and extract known mimetypes (and replace in a blob of text).

Note the included corpus is not complete. Final corpus will be based on the IANA set plus any geospatial mimetypes (not generally found in the IANA list) and any other mimetypes related.