Closed schwittlick closed 7 years ago
one approach would be to detect the language of each sentence, while parsing it and separating them:
language detection should be pretty stable with this module, it knows 55 languages by default
# taken from https://github.com/Mimino666/langdetect
# available for python 2.* and 3.*
>>> from langdetect import detect
>>> detect("War doesn't show who's right, just who's left.")
'en'
>>> detect("Ein, zwei, drei, vier")
'de'
>>> detect_langs("Otec matka syn.")
[sk:0.572770823327, pl:0.292872522702, cs:0.134356653968]
just started a new parsing batch. not using any sentences which are not in english for now..
not sure what to do about it, training models on different languages at the same time creates a bit of a problem for the neural nets.. maybe we should separate them by language, by hand?