On-the-fly language detection & warning

yk commented 1 year ago

Many users enter text in a language that they have not selected in their selector. Sometimes that's ok, like when translating, or when I get a prompt in German (but I'm set to English), I can still answer it, but we should prevent accidental submission of mismatched language data, especially since we are going to filter out this data in the near future, so it would be a shame if lots of work gets thrown out.

The idea: build an on-the-fly language detector, then warn the user that the language of their input does not seem to match the language that they selected, and ask them to either switch their language selector, or to confirm that they indeed want to submit their text. Maybe we only need this in the "initial prompt" task.

andreaskoepf commented 1 year ago

Some research on language detection has already been done by @MattAlexMiracle in language_classification.py.

yk commented 1 year ago

I don't think there's a need to implement that ourselves: https://github.com/pemistahl/lingua-py https://github.com/Mimino666/langdetect

AbdBarho commented 1 year ago

would it make sense to look for a js based solution that runs in the browser? would save a lot of compute.

MattAlexMiracle commented 1 year ago

I looked into langdetect at the beginning, but that didn't have that great performance on small-ish texts (though its better than e.g. https://github.com/saffsd/langid.py which I also looked at). Mostly langdetect is a monster half java (where the original library is from) and half python interface, which would make it hard to upgrade should we need to.

lingua-py is interesting since their focus is specifically for short texts, so that might be worth a look. At the moment I'm trying to model everything on a rolling window of ten words, since that would allow us to work with language switches (e.g. for translation), but we could do the same for lingua-py.

However, I really like @AbdBarho 's idea of doing it directly in browser, though not necessarily for compute, but rather for instant feedback: If the user has "language A" selected, but starts typing in "language B" we can prompt him whether he would like to change the language. This is probably the best to deal with mixed-language content and would also fix the problem of people just forgetting to set the proper language: If you prompt them whether to change the language and they say "yes", then the language is correct for the next one. If they say "no" then it was e.g. a misdetection or a translation prompt.

However, I'm not aware of any library that does this, nor do I have javascript experience to write one myself. Maybe @AbdBarho knows something about this?

kpoeppel commented 1 year ago

I had a quick look into some javascript libraries: https://www.npmjs.com/package/languagedetect) https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/i18n/detectLanguage https://github.com/loretoparisi/fasttext.js#language-identificaton-server https://github.com/chattylabs/language-detector https://github.com/loretoparisi/fastLangID

Most of these do not seem like they are actively maintained nevertheless.

The one I would recommend based on the maintenance status is: https://github.com/wooorm/franc

I can also try to implement it in our frontend.

Update: I opened a PR resolving this issue using the lande library https://github.com/fabiospampinato/lande.

After 10 words entered it detects the most probable language, then opens up a Modal asking for a language change. The Prompter can then either ignore the message - leading to a doubled word limit for spawning language detection - or set a cookie to disable language detection. Yet there is no re-enable language detection. Is that in line with the issue's intentions?

Jourdelune commented 1 year ago

Firefox translation (https://github.com/mozilla/firefox-translations) detect the language on the browser with fastText language identification, also fastText is the most fastest model aviable (and with good accuracy).

MattAlexMiracle commented 1 year ago

I also looked into lingua-py, the accurate mode is a little slow compared to my simple model ( which is a simple trigram->PCA(component=400)->SVM(l1 penality, linear kernel, C=1.0)). The fast mode works decently well, though its also slower than my method, presumably since I got my model down to 70% sparsity without having to prune anything and because mine can work on batches, while I did not find a way to do this for lingua. The SVM approach takes 5s for my testing set, while the "fast" version of lingua takes 221s (while performing a little worse). I don't know how important that is in practice since I don't know how many judgments we will have to do (if its not that many, the accuracy focused lingua-py beats my approach currently). If possible, we should integrate into the frontend like @kpoeppel suggests.

@kpoeppel What I currently do is use a rolling window of 10 word sequences to get a distribution of languages that might be in the text. This helps with multi-language inputs, but also may be unnecessary if you give immediate language feedback from the frontend, which would "catch" multi language input just as well. (getting the distribution is probably more relevant if you want to wait until the full prompt is sent)

andreaskoepf commented 1 year ago

@kpoeppel if possible please consider joining the OA discord (ping me).

dufoli commented 1 year ago

Why not check Java script property navigator.language if not the same than langage of sentence ask to confirm langage.

yk commented 1 year ago

closing this due to the #1071 being merged

LAION-AI / Open-Assistant

On-the-fly language detection & warning #997