bitextor / warc2text

Extracts plain text, language identification and more metadata from WARC records
MIT License
20 stars 5 forks source link

Lowercase before langid #42

Open ZJaume opened 11 months ago

ZJaume commented 11 months ago

If we are going to use FastText, we should be applying lowercase before language identification. At least in the official lid.175 model, uppercased text completely messes up the identification for mid/low-resource languages, always identifying them as the highest resource language of the script (Russian for cyrillic, English/Spanish/French for latin).

jelmervdl commented 11 months ago

@laurieburchell do you have an opinion on this?

Ideally if this is the case, this would be a part of the model, and not an option inside warc2text, as it would be really hard to keep track of which model benefits from it, and which doesn't.

On the other hand, I can also understand that the web is kind of garbage and there's a lot of ALL UPPER CASE text out there that's not in the training data. And that doesn't match any ngrams in the model.

Maybe we should train a model on explicitly all lower case text, see whether it degrades performance a lot, and if it doesn't do indeed just classify always on lowercase?

laurieburchell commented 11 months ago

I would suggest building the lowercasing into the LID model if possible - apart from anything else, it helps deal with feature sparsity.