Closed CaptainCaptcha closed 2 years ago
Ah, interesting! It might have to do with the quality of the training data used for the Turkish model.
@pierotofy Yeah, looks like some idiots trolled the data for Turkish model. You might want to use another one. Thanks for the response.
Well wasn't expecting this...
argosopentech@dev:~/turkish$ grep -n -i xhampster *.en
CCAligned.en-tr.en:5169773:Hen*** Xhampster
CCAligned.en-tr.en:5633651:xHampster
CCAligned.en-tr.en:5953336:Xhampster
CCAligned.en-tr.en:6462245:www xhampster
CCAligned.en-tr.en:6890391:86 xHampster
CCAligned.en-tr.en:7864361:86 xHampster
CCAligned.en-tr.en:8557318:m**ure a**l xhampster
CCAligned.en-tr.en:8778707:86 xHampster
CCAligned.en-tr.en:9356316:64 xHampster
CCAligned.en-tr.en:9828279:xhampster
CCAligned.en-tr.en:13434664:sallow skirt - unforeseen synopsis be proper of xhampster
I don't think it was malicious though. It looks like the Turkish data for CCAligned which is scraped from the internet automatically. There is a lot of porn on the internet so probably not too surprising that this ended up in the data.
We could add "xhampster" to list of filtered phrases but until there's a more robust way of dealing with profanity probably not worth it. Training separate models without profanity is possible, but in some situations we may want to correctly translate it too. Generating profanity when it isn't given as an input definitely isn't ideal though.
Looking more there's actually quite a bit of profanity in the data. CCAligned seems to have much more than WikiMatrix and some other datasets.
We already filter some words out before training and could add more if this becomes a problem.
@PJ-Finlay That's interesting, thanks for your research.
Didn't knos libretranslate could become a search for which website of porn I want to watch lol
Hi, I am really confused right now. If you enter meaningless words like random characters, English to Turkish translation has some really weirds results.
Here they are :
I did not manipulate the website in any ways. You can try by going to https://libretranslate.com/ and selecting English -> Turkish then typing things I wrote above. I was about to use this on a project of mine but I luckily saw this. How is this even possible ? When I write something meaningless like "aaaa", it should output as "aaaa" not "xHamster". I'm both angry and hilarious right now. This is ridiculous. Please fix this and ban contributors of these translations.