Closed SriniNaga closed 6 years ago
@SriniNaga you should normalize text at least
normalize_text() {
sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e "s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
-e 's/"/ " /g' -e 's/\./ \. /g' -e 's/<br \/>/ /g' -e 's/, / , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
-e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
-e 's/«/ /g' | tr 0-9 " "
}
Hi @SriniNaga,
Thank you for reporting this issue.
We did not apply any pre-processing on the data before training the language identifier. The motivation is that the pre-processing that should be applied to a piece of text depends on the language (e.g. tokenization is language dependent). For this reason, the language identification is sensitive to punctuation.
One of the issue here is that the text is very short, making the predictions of our model less reliable. We might release in the future a new version of the model which works better on very short texts.
Best, Edouard.
@SriniNaga @EdouardGrave I'm just adding that we have successfully replaced both CLD2 / Google Translate API (for detection) with this model, since the accuracy for each language couples and even mixed-languages text was the same or even better. For this task in inference I'm applying a text pre-processing like the function above and some other kind of things like diacritics removal, and also double byte detection for chinese/japanese etc. Also in our case the text in inference is a document, not a short sentences (like tweets, etc).
NOTE: To be correct, I'm aware of a new version of CLD2 neural network based (word embedding) that is CLD3, but I haven't tried it yet, but I think this fastText model with language preprocessing is absolutely a good choice!
Hope this helps!
@loretoparisi @EdouardGrave Thank you ! Most of our texts are short(Ranging from 20 chars to 2000 chars). For long texts the accuracy is coming good, but the problem is with short texts only.
One of the issue here is that the text is very short, making the predictions of our model less reliable. We might release in the future a new version of the model which works better on very short texts.
Hi @EdouardGrave , any news on this?
test text: "blast mango blast it" detected : fr test text: "blast, mango blast it" detected : en Does it require any special preprocessing before language detection