Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
707 stars 132 forks source link

Bot-generated sentences #1492

Open trang opened 7 years ago

trang commented 7 years ago

Problem

Lately there has been a large number of sentences added seemingly by some bot.

https://tatoeba.org/eng/sentences/of_user/VITAE https://tatoeba.org/eng/sentences/of_user/Strategos https://tatoeba.org/eng/sentences/of_user/Alva

Here's a sample of the kind of sentences added:

Il désire gamberger.
Miou-Miou s'échappe.
Le cuisinier jongle.
Des scouts pleurent.
Vous alliez partout.
L'architecte dérive.
Des fakirs mèneront.
J'aplatis ces mises.
Des bébés embraient.
La voyageuse stagne.
Un groupe se coiffe.
Nous allons à Kyoto.
Des fumeurs aboient.
Le traitre guerroie.

A large part of these sentences don't make much sense. While they aren't all incorrect, they are overall not bringing high value to the corpus.

Possible solution

Just like we've put a limit for the amount of private messages that new users can send per day, we could put a limit on how many sentences a new contributor can add per day. This would at least give more time for admins to react and avoid thousands of nonsensical sentences being added.

Ppjet6 commented 7 years ago

On 2017/08/13, Trang wrote:

Problem

Lately there has been a large number of sentences added seemingly by some bot.

https://tatoeba.org/eng/sentences/of_user/VITAE https://tatoeba.org/eng/sentences/of_user/Strategos https://tatoeba.org/eng/sentences/of_user/Alva

Here's a sample of the kind of sentences added:

Il désire gamberger.
Miou-Miou s'échappe.
Le cuisinier jongle.
Des scouts pleurent.
Vous alliez partout.
L'architecte dérive.
Des fakirs mèneront.
J'aplatis ces mises.
Des bébés embraient.
La voyageuse stagne.
Un groupe se coiffe.
Nous allons à Kyoto.
Des fumeurs aboient.
Le traitre guerroie.

A large part of these sentences don't make much sense. While they aren't all incorrect, they are overall not bringing high value to the corpus.

The timestamps of the sentences though suggest they used some kind of automation to add the them indeed.

Honestly I don't think this particular example is harmful to Tatoeba. The sentences seem relatively correct.

Possible solution

Just like we've put a limit for the amount of private messages that new users can send per day, we could put a limit on how many sentences a new contributor can add per day. This would at least give more time for admins to react and avoid thousands of nonsensical sentences being added.

I am not against what you suggest, but I think it would be good to detail a bit more what we want for an implementation. Going to board a plane atm, I'll try and have a look at what we currently do for messages.

-- Maxime “pep” Buquet

halfdan commented 7 years ago

It might also be a good idea to think of a level system where you have to contribute a couple sentences and get those reviewed by another member before being allowed to continue.

Ppjet6 commented 7 years ago

On 2017/10/19, Fabian Becker wrote:

It might also be a good idea to think of a level system where you have to contribute a couple sentences and get those reviewed by another member before being allowed to continue.

Yep that would be a good idea. I'd be quite in favor of that, we'd have to find a way to get sensible defaults though to not discouraged the legit user that arrives for the first time.

That could also go into the gamification discussion that appeared at some point (at least on the channel). I'm not sure there is any issue about this yet.

-- Maxime “pep” Buquet

ckjpn commented 5 years ago

Related:

https://tatoeba.org/eng/sentences/show/7701672

screen shot 2019-01-29 at 18 48 19
trang commented 3 years ago

Just for the record, there's been a recent report of bots: https://tatoeba.org/eng/wall/show_message/36111#message_36111

ckjpn commented 3 years ago

Just in case this helps, here are the number of sentences contributed in certain languages by usernames last week that DO NOT have linked sentences. This may help you identify some of the problem usernames. Note that this is only one week's data. and limited to only those with over 100 sentences in the given language that don't have links.

kab : 9522 :https://tatoeba.org/eng/user/profile/Iflis_Illel kab : 8017 :https://tatoeba.org/eng/user/profile/imalaqvayli kab : 4095 :https://tatoeba.org/eng/user/profile/Selyan kab : 1705 :https://tatoeba.org/eng/user/profile/Igider eng : 1000 :https://tatoeba.org/eng/user/profile/CK kab : 978 :https://tatoeba.org/eng/user/profile/Ubezwi1 kab : 838 :https://tatoeba.org/eng/user/profile/alemfarid kab : 802 :https://tatoeba.org/eng/user/profile/yiwenkan hun : 703 :https://tatoeba.org/eng/user/profile/Tilelli eng : 663 :https://tatoeba.org/eng/user/profile/Amastan kab : 470 :https://tatoeba.org/eng/user/profile/BenkerouHani ber : 423 :https://tatoeba.org/eng/user/profile/Tilelli hun : 392 :https://tatoeba.org/eng/user/profile/Tamazight spa : 367 :https://tatoeba.org/eng/user/profile/Javea ber : 327 :https://tatoeba.org/eng/user/profile/Tamazight eng : 299 :https://tatoeba.org/eng/user/profile/IE eng : 299 :https://tatoeba.org/eng/user/profile/DJ_Saidez ber : 218 :https://tatoeba.org/eng/user/profile/Mouloud kab : 197 :https://tatoeba.org/eng/user/profile/AmarMecheri rus : 178 :https://tatoeba.org/eng/user/profile/marafon spa : 126 :https://tatoeba.org/eng/user/profile/Tagawawt deu : 122 :https://tatoeba.org/eng/user/profile/Pfirsichbaeumchen