Open PJ-Finlay opened 3 years ago
This would be pretty useful for any automated translation mechanism!
Interesting, I think using the same pipeline would be a good long term solution but this could be a something to do in the meantime. One issue with using the pipeline is that as soon as a we add a new language we have to also retrain the detector. This would probably also be lighter weight vs a 100MB model file. The main interest for this is currently from LibreTranslate so if someone wants to extend the Python API to use this that would be welcome and then the API could be reimplemented in the future if it makes sense.
Some support was added to LibreTranslate in https://github.com/uav4geo/LibreTranslate/pull/12
Recently I saw an article about the comparison of language detection tools. FastText can be a viable option instead of langdetect, because it is lot faster.
We have an another option which can be quite accurate in case of longer texts: N-grams. There are predetermined n-grams for all supported languages and it is easy the generate new lists. The advantages of using this approach is that the models are really small, the implementation is easy and we it does not need any extra library. In any case, if help needed, I can implement these.
@hollorol If you can do this with jus the Python standard library a pull request would be appreciated.
@PJ-Finlay, I'll do it only for the cli, because I don't use the GUI part of the program; but I guess after it, adapt it to the GUI will be easy.
That sounds good, it should probably be it's own file/module that can be integrated into the CLI.
Lingua might be useful for this. Lingua is made with python, works with short strings, works offline, and licensed under Apache-2.0.
LibreTranslate already has a system for language detection so this hasn't been a priority. My plan was to use CTranslate2 models to map input text into a language code but open to suggestions.
Not everyone uses LibreTranslate.
The way Argos Translate currently works it would be a breaking change to add this but I'm planning to add it in the next major version. It would also be possible to add language detection to the GUI (which is in a separate repo) using a third party library like Lingua.
I could see it being used like a special input that would trigger the language detection. Syntax could be something like this:
echo "Text to translate" | argos-translate --from-lang auto-detect --to-lang en
This is the way to do it for core Argos Translate, the only thing I might change is "detect" instead of "auto-detect".
The plan for this was to train a model using the existing infrastructure that maps from input text to a language code. This would require adding a way to generate this data in the training scripts and what is hopefully a pretty small code change to support this. I'd be pretty optimistic about this just working pretty well out of the box but it may take some tweaking.