LibreTranslate / LibreTranslate

Free and Open Source Machine Translation API. Self-hosted, offline capable and easy to setup.
https://libretranslate.com
GNU Affero General Public License v3.0
9.7k stars 876 forks source link

English -> Turkish translation results in inappropriate websites ? #141

Closed CaptainCaptcha closed 2 years ago

CaptainCaptcha commented 3 years ago

Hi, I am really confused right now. If you enter meaningless words like random characters, English to Turkish translation has some really weirds results.

Here they are :

2021-09-23 01_32_39-Window 2021-09-23 01_32_26-Window 2021-09-23 01_31_58- 2021-09-23 01_31_14-Window 2021-09-23 01_30_42-Window

I did not manipulate the website in any ways. You can try by going to https://libretranslate.com/ and selecting English -> Turkish then typing things I wrote above. I was about to use this on a project of mine but I luckily saw this. How is this even possible ? When I write something meaningless like "aaaa", it should output as "aaaa" not "xHamster". I'm both angry and hilarious right now. This is ridiculous. Please fix this and ban contributors of these translations.

pierotofy commented 3 years ago

Ah, interesting! It might have to do with the quality of the training data used for the Turkish model.

CaptainCaptcha commented 3 years ago

@pierotofy Yeah, looks like some idiots trolled the data for Turkish model. You might want to use another one. Thanks for the response.

PJ-Finlay commented 3 years ago

Well wasn't expecting this...

argosopentech@dev:~/turkish$ grep -n -i xhampster *.en
CCAligned.en-tr.en:5169773:Hen*** Xhampster 
CCAligned.en-tr.en:5633651:xHampster 
CCAligned.en-tr.en:5953336:Xhampster 
CCAligned.en-tr.en:6462245:www xhampster 
CCAligned.en-tr.en:6890391:86 xHampster 
CCAligned.en-tr.en:7864361:86 xHampster 
CCAligned.en-tr.en:8557318:m**ure a**l xhampster 
CCAligned.en-tr.en:8778707:86 xHampster 
CCAligned.en-tr.en:9356316:64 xHampster 
CCAligned.en-tr.en:9828279:xhampster 
CCAligned.en-tr.en:13434664:sallow skirt - unforeseen synopsis be proper of xhampster 

I don't think it was malicious though. It looks like the Turkish data for CCAligned which is scraped from the internet automatically. There is a lot of porn on the internet so probably not too surprising that this ended up in the data.

We could add "xhampster" to list of filtered phrases but until there's a more robust way of dealing with profanity probably not worth it. Training separate models without profanity is possible, but in some situations we may want to correctly translate it too. Generating profanity when it isn't given as an input definitely isn't ideal though.

PJ-Finlay commented 3 years ago

Looking more there's actually quite a bit of profanity in the data. CCAligned seems to have much more than WikiMatrix and some other datasets.

We already filter some words out before training and could add more if this becomes a problem.

CaptainCaptcha commented 3 years ago

@PJ-Finlay That's interesting, thanks for your research.

Thewisem commented 3 years ago

Didn't knos libretranslate could become a search for which website of porn I want to watch lol