Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
714 stars 131 forks source link

have a way to detect duplicates which differ only in terms of script #76

Closed loolmeh closed 4 years ago

loolmeh commented 10 years ago

such as traditional/simplified chinese


Original ticket: https://www.assembla.com/spaces/tatoeba2/tickets/296

jiru commented 7 years ago

So this could be implemented rather easily now #77 is implemented. However, when merging two or more sentences, which one should we keep? There are pairs of Chinese duplicates written in traditional and simplified by different users. I don’t think either of them would be happy with having her sentence deleted in favour of the other.

trang commented 4 years ago

Closing because this is not something we can automate.

If we have two sentences that differ only in terms of script, we will have to handle the case manually, in a similar way as what we would have to do with French sentences that differ only by a space (#770).