b-cube / semantics-preprocessing

initial text preprocessors for the triplestore and feature classification
Other
2 stars 3 forks source link

Bag of Words + Unicode Decode Unicode cruft returns #87

Open roomthily opened 9 years ago

roomthily commented 9 years ago

Honestly not sure if this ran before/after implementation but it's fantastic either way.

unicode_cruft_fail

And also really want to know where this came from re: devil donuts.

roomthily commented 9 years ago

And whatever we want to call this:

solr_cruft_fail

Update: this is called trying to embed math formulas in an email listserv.

roomthily commented 9 years ago

Fun fact, the cruft in the first image is emoji-related and we don't manage that (or nutch doesn't handle that).

The source file:

we_cannot_handle_emoji

roomthily commented 9 years ago

Should be okay with the unicode/decode here: b429761.

Honestly quite disappointed in the lack of emoji support :(.