Schwittleymani / ECO

Electronic Chaos Oracle
https://schwittlick.net/eco
Apache License 2.0
6 stars 1 forks source link

PDF's contain lots of german, spanish and french #173

Closed schwittlick closed 7 years ago

schwittlick commented 7 years ago

not sure what to do about it, training models on different languages at the same time creates a bit of a problem for the neural nets.. maybe we should separate them by language, by hand?

schwittlick commented 7 years ago

one approach would be to detect the language of each sentence, while parsing it and separating them:

language detection should be pretty stable with this module, it knows 55 languages by default

# taken from https://github.com/Mimino666/langdetect
# available for python 2.* and 3.*
>>> from langdetect import detect
>>> detect("War doesn't show who's right, just who's left.")
'en'
>>> detect("Ein, zwei, drei, vier")
'de'
>>> detect_langs("Otec matka syn.")
[sk:0.572770823327, pl:0.292872522702, cs:0.134356653968]
schwittlick commented 7 years ago

just started a new parsing batch. not using any sentences which are not in english for now..