Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
707 stars 132 forks source link

It's inconvenient to massively delete spam sentences #2357

Open trang opened 4 years ago

trang commented 4 years ago

The sentences in language "unknown" are currently filled with spam sentences.

Ricardo is regularly checking the list to see if there are new users who are adding sentences in new languages and had to navigate through all the spam.

AndiPersti commented 4 years ago

Shouldn't we delete the spam sentences?

jiru commented 4 years ago

How can we know a sentence is spam or not since it can be in virtually any language?

AndiPersti commented 4 years ago

I guess @trang is talking about the sentences from the latest Chinese spammer which make up about 50% of the sentences in an unknown language. (I've just searched for the distinct signature in the sentence text.)

trang commented 4 years ago

Yes, I was talking about the Chinese spam that we had recently.

We can indeed simply delete these sentences but since there are many of them, first of all it's very tedious to delete one by one. Then secondly, it will create temporary spam in the "Latest contributions". In my case, I would not feel very compelled to delete the spam sentences because I don't want to pollute the "Latest contributions". If it's just one or two sentences, that's no problem. If it's hundreds of sentences, it doesn't feel nice to see on the homepage.

I'll change the title because the root issue is that it's inconvenient to massively delete spam sentences. The fact that it's difficult to check for legit unknown language sentences is a side effect.

jrpear commented 4 years ago

I think I'd like to work on this issue next.

I'm still not quite sure about GitHub etiquette. Should I start discussing how I think this should be implemented on this issue page or on an empty PR like last time?

AndiPersti commented 4 years ago

Instead of opening an empty PR I think it would be better to discuss your idea here on the issue.

jrpear commented 4 years ago

Ok, good to know.

Here's what I'm imagining for a solution:

Optional, not sure how useful this would be or how difficult it is to implement:

Slightly related: if one user is putting out a lot of spam, it may be useful to just look at contributions from that user. I know there's a "Latest contributions from user X" page, but I could only access it from the "Currently contributing" section of the "Latest contributions" page. There's no "contributions" tab on a user's profile page. I feel like it should be accessible from there. If that would be helpful, I can also make an issue and PR for that.

trang commented 4 years ago

Note that there is a "contributions" tab on the user's profile, but it's called "Logs". For instance, here's mine: https://tatoeba.org/eng/contributions/of_user/TRANG

There's one issue with using the "Latest contributions" page as a starting point: we don't display more than 1000 entries so it would be possible to mass delete only recent sentences. The spam sentences mentioned in the original description would not be deletable from the "Latest contributions".

As far as I'm concerned, I would say the best user experience would be to have something similar to what we have nowadays with emails.

Obviously, it would not be easy to implement this all at once so it's up to you to choose which part you prefer to tackle first. That is if you and other people agree it's a good idea :)

jrpear commented 4 years ago

Ahh okay good to know contributions are accessible from a user's profile page.

I'm thinking that these check boxes would be available on any contributions page, whether it be user or latest contributions. So for the spam from user aaaa1111 you'd be able to mass select from his user logs page.

This would be implemented in a way that it could easily be reused in a "spam folder" as well. This would be a step towards the spam folder, and useful on its own, so I think it's a good place to start.

Another small addition I think would be nice is an "invisible delete" feature, which would allow maintainers to delete sentences without those deletions being shown on "Latest contributions" or their own contributions page.

After that I could give users the option to report contributions as spam (or offensive, would help users like this one), and add a "spam/offensive folder" viewable only by maintainers that could be filtered by language and report type.

Then the final (quite possibly never reached) step would be building an automated spam detection tool. I imagine this will be especially difficult considering there are like 200+ languages on Tatoeba.

If this sounds agreeable, I'll open separate issues for:

Also, one thing that I think would be helpful to know:

AndiPersti commented 4 years ago

I don't think restricting the mass selecting/deleting of sentences to the "contribution" pages is very useful:

So how about using tags for marking sentences as spam? If #1923 is implemented it would be possible to mark spam sentences everywhere where a sentence block is displayed. This would be very useful for the search page because usually spam sentences have a distinct pattern and searching for that pattern will list all sentences on one page independent of who added them. (For mass tagging we need to implement #785.) The "spam folder" would be the spam tag's page where moderators could review/delete them.

Another small addition I think would be nice is an "invisible delete" feature, which would allow maintainers to delete sentences without those deletions being shown on "Latest contributions" or their own contributions page.

I think the "invisible delete" should only prevent the insertion into last_contributions but otherwise there should still be a normal log entry in contributions.

jrpear commented 4 years ago

Ah yes, being able to search with a pattern that is in a lot of spam strings seems very important, so however spam deletion is done, it definitely should be compatible with searching.

I read through the discussion on #785 and I got the impression that the tag feature is already pretty overloaded. I think that using it for spam, while quick and easy, would exacerbate the problem. Seems like there's still a lot that needs to be done with tags.

So, I think it should be kept as a separate feature.

@trang's solution seems like the way to go.

Add a report button like this, accessible by all users: report From here, users can report a sentence as spam or offensive.

Add a report management page that only corpus maintainers can view with mass deletion tools. This could be filtered by language.

Automatic spam detection comes last.

I think the "invisible delete" should only prevent the insertion into last_contributions but otherwise there should still be a normal log entry in contributions.

Makes sense.