Sotera / newman

Quickly analyze and explore email with advanced analytics and visualization.
http://sotera.github.io/newman/
Apache License 2.0
55 stars 14 forks source link

Merlin - bring back auto-language detection #121

Open smahoney58 opened 6 years ago

smahoney58 commented 6 years ago

Newman used to have the capability to auto detect the language used in an email and index it appropriately. Now the user has to pick the language before ingesting. Problems with this include:

1 - how do you know what language is used in the email before you ingest 2 - only works if there are just two languages is in the email dataset (i.e. email datasets that have English, Spanish, and Chinese emails can't be processed since you can only pick one other language).

Currently, the only other language supported is Spanish. Issue #120 is the request to support other languages.

In general, how version 4.x handles multiple languages needs to be re-designed and re-implemented. Almost every dataset we have ingested includes multiple languages.