Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
714 stars 132 forks source link

Having the number of translations in the advanced search #3090

Open Guybrush88 opened 11 months ago

Guybrush88 commented 11 months ago

As reported by cojiluc on the wall:

Please consider to add "Number of Translations" (at least, at most) (link: direct, indirect) in the search criteria in Advanced Search.

Some advantages: (1) For people who intend to translate or to find sentences with most translations could be useful, these sentences could be sometimes among the most popular/universal or the most easy sentences. (2) For people who intend to translate or to find sentences with few translations could be useful, these sentences could be sometimes among the most "virgin" sentences, or the less noisy sentences, etc. (3) Combining this criterion with some already present criteria could be very useful for the user to localize good sentences.

For the "Length" of a sentences, the advanced search has already this useful feature: Length (At least, At most).

"Number of translations" is not less important than some other criteria. Let compare it with two already criteria "orphan" and "unapproved" sentences. The below statistics (for top 20 languages on Tatoeba) shows that for most languages, orphan sentences and unapproved sentences are not a big deal. I am not saying orphan/unapproved criteria it is not useful, but my only point is that when we have these criteria for filtering just handful of sentences among tens of thousands of sentences, let have "Number of Translations'' as well.

Language; number of all sentences; number of orphan sentences; number of unapproved sentences

English; 1.8M; 47,173; 5,221 Russian 1M; 243; 78 Italian 868K; 0; 18; Esperanto 736K; 23; 61 Turkish 732K; 281; 237 Kabyle 696K; 16; 42 Berber 651K; 29; 546 German 634K; 7; 50 French 587K; 295; 6,383 Portuguese 424K; 1,156; 86 Spanish 407K; 11; 2,773 Hungarian 401K 2,048; 25 Japanese 241K 100,369; 176 Hebrew 201K; 307; 19 Ukrainian 184K; 0; 13 Dutch 179K; 0; 36 Finish 147K; 25; 7 Polish 124K; 0; 38 Lithuanian 99K; 325; 2 Macedonian 78K; 6; 2

https://tatoeba.org/it/wall/show_message/40365#!#message_40365

ckjpn commented 11 months ago

I suspect that if this were to be done, it might be best to not do this in real time, but to generate the number of direct links a sentence has only from time to time -- perhaps once a week, before the weekly downloadable files are created, and then also create a downloadable file with these numbers.

LBeaudoux commented 11 months ago

(1) For people who intend to translate or to find sentences with most translations could be useful, these sentences could be sometimes among the most popular/universal or the most easy sentences.

From my experience, I've learned that the most linked sentences of a language are primarily those that are:

  1. older
  2. shorter
  3. translated/post-linked several times by a single Tatoeban

Surfacing the sentences with a high number of translations would reinforce these biases.

(2) For people who intend to translate or to find sentences with few translations could be useful, these sentences could be sometimes among the most "virgin" sentences, or the less noisy sentences, etc.*

I doubt that there are many translators out there looking for these "virgin" sentences.

(3) Combining this criterion with some already present criteria could be very useful for the user to localize good sentences.

I don't think an extra filter is the proper way to help translators find better sentences to translate. Rather, we should measure the relative number of translators for a sentence compared to its closest peers of the same language, age and length. And then we could use this popularity score as a sorting option for the advanced search.

ckjpn commented 11 months ago

I, too, sort of doubt this would be all that useful, for the reasons mentioned above.

As for "virgin" sentences, those with no translations, these can already be found using the "exclude", "any language", and "direct link" or "any link" options.

Template (pre-filled form): https://tatoeba.org/en/sentences/advanced_search?&trans_filter=exclude&trans_link=&sort=random

Currently 1,827,697 occurrences 15.5% of our sentences 1,827,697/11,779,865