argosopentech / argos-translate

Open-source offline translation library written in Python
https://www.argosopentech.com
MIT License
3.81k stars 278 forks source link

Support Language Detection #9

Open PJ-Finlay opened 3 years ago

PJ-Finlay commented 3 years ago

The plan for this was to train a model using the existing infrastructure that maps from input text to a language code. This would require adding a way to generate this data in the training scripts and what is hopefully a pretty small code change to support this. I'd be pretty optimistic about this just working pretty well out of the box but it may take some tweaking.

pierotofy commented 3 years ago

This would be pretty useful for any automated translation mechanism!

PJ-Finlay commented 3 years ago

Interesting, I think using the same pipeline would be a good long term solution but this could be a something to do in the meantime. One issue with using the pipeline is that as soon as a we add a new language we have to also retrain the detector. This would probably also be lighter weight vs a 100MB model file. The main interest for this is currently from LibreTranslate so if someone wants to extend the Python API to use this that would be welcome and then the API could be reimplemented in the future if it makes sense.

thomas536 commented 3 years ago

Some support was added to LibreTranslate in https://github.com/uav4geo/LibreTranslate/pull/12

hollorol commented 2 years ago

Recently I saw an article about the comparison of language detection tools. FastText can be a viable option instead of langdetect, because it is lot faster. image

We have an another option which can be quite accurate in case of longer texts: N-grams. There are predetermined n-grams for all supported languages and it is easy the generate new lists. The advantages of using this approach is that the models are really small, the implementation is easy and we it does not need any extra library. In any case, if help needed, I can implement these.

PJ-Finlay commented 2 years ago

@hollorol If you can do this with jus the Python standard library a pull request would be appreciated.

hollorol commented 2 years ago

@PJ-Finlay, I'll do it only for the cli, because I don't use the GUI part of the program; but I guess after it, adapt it to the GUI will be easy.

PJ-Finlay commented 2 years ago

That sounds good, it should probably be it's own file/module that can be integrated into the CLI.

TechnologyClassroom commented 2 years ago

Lingua might be useful for this. Lingua is made with python, works with short strings, works offline, and licensed under Apache-2.0.

PJ-Finlay commented 2 years ago

LibreTranslate already has a system for language detection so this hasn't been a priority. My plan was to use CTranslate2 models to map input text into a language code but open to suggestions.

TechnologyClassroom commented 2 years ago

Not everyone uses LibreTranslate.

PJ-Finlay commented 2 years ago

The way Argos Translate currently works it would be a breaking change to add this but I'm planning to add it in the next major version. It would also be possible to add language detection to the GUI (which is in a separate repo) using a third party library like Lingua.

TechnologyClassroom commented 2 years ago

I could see it being used like a special input that would trigger the language detection. Syntax could be something like this:

echo "Text to translate" | argos-translate --from-lang auto-detect --to-lang en
PJ-Finlay commented 2 years ago

This is the way to do it for core Argos Translate, the only thing I might change is "detect" instead of "auto-detect".