Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
714 stars 132 forks source link

Unapproved sentences (marked red) should not be included in the bilingual pair exports. #3054

Closed KK-kaku closed 1 year ago

KK-kaku commented 1 year ago

Sentence #5254855 is unapproved. スクリーンショット 2023-04-23 183036

However, it's included in the jp-en pairs data. スクリーンショット 2023-04-23 183937

The Sentences data file excludes it. スクリーンショット 2023-04-23 183433

ckjpn commented 1 year ago

I think that instead of being a bug, this was a feature that was purposely implemented.

I'm fairly certain that such sentences are not included since they are very, very likely not to be appropriate for those studying language.

KK-kaku commented 1 year ago

Sorry, my comment wasn't clear. My suggestion is to exclude unapproved sentences from the pairs data.

ckjpn commented 1 year ago

This should be considered a bug then.

I suggest changing the title to: Unapproved sentences (marked red) should not be included in the bilingual pair exports.

I think the following concisely sums up your request.

Unapproved sentences (marked red) should not be included the the bilingual pair exports. These shouldn't be included for the same reasons these sentences are not included in the weekly exports (copyright infringements, outright errors but sentences not yet deleted for some reason, etc.)

Note that this, of course, doesn't guarantee that both items each pair will be correct, but it is a step in the right direction.