Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
703 stars 132 forks source link

Mark sentences as "not directly linked" #1980

Open ckjpn opened 4 years ago

ckjpn commented 4 years ago

Mark sentences as "not directly linked".

Reason / Why does it help?

Short: Working together on finding directly linkable sentences gets enabled.

If someone tries to find indirectly linked sentences, that could be directly linked, he/she can use this search: https://tatoeba.org/eng/sentences/search?query=&from=jpn&to=deu&orphans=no&unapproved=no&user=&tags=&list=&has_audio=yes&trans_filter=limit&trans_to=deu&trans_link=indirect&trans_user=&trans_orphan=&trans_unapproved=&trans_has_audio=&sort=words

He/She reads through all 65 pages of sentences and figured, all indirectly linked sentences are not directly linkable.

Now a second user wants to do the same: Trying to directly link indirectly linked sentences. This second user will also have to read through all 65 pages of sentences to figure, there is nothing to do.

Poulpisator commented 4 years ago

I would also be happy to have this function! I wasn't aware of that wall message so I realized just now that ask the same thing on the wall few weeks later. For the same reasons presented above, I think that it could give a very good boost of productivity for the users doing this kind of linking activity.

Inego commented 4 years ago

After the issue renaming it became much more confusing to understand what is requested. To me it seems that the requested feature is in fact "the ability to explicitly mark two sentences as directly unlinkable".

alanfgh commented 4 years ago

To me it seems that the requested feature is in fact "the ability to explicitly mark two sentences as directly unlinkable".

No, that is not correct. The requested feature is the ability to search for sentences that are already indirectly linked, but not directly linked. The idea is that indirectly linked sentences are often good candidates for direct linking, so a search that finds indirectly linked sentences but excludes directly linked sentences would be an efficient way of bringing up good sentence pairs to link directly.

AndiPersti commented 4 years ago

To me it seems that the requested feature is in fact "the ability to explicitly mark two sentences as directly unlinkable".

No, that is not correct. The requested feature is the ability to search for sentences that are already indirectly linked, but not directly linked.

But I agree with Inego and don't think that's what MisterTrouser suggested in the wall message. Furthermore (as mentioned in that wall message) isn't this search already possible?

Inego commented 4 years ago

What @ckjpn told (as far as I understand) is that he was missing the ability to somehow make certain sentences disappear from the "have indirect translation" search results, so that (presumed they cannot in fact be directly linked) they would not appear in this search anymore, so as not to litter the other, useful (i.e. possibly linkable) results. That establishing the fact that two sentences are not directly linkable is work, and it has a certain result that ought to be persisted somewhere and be useful in future searches.

EDIT: I'm sorry, that was not what @ckjpn told, but a quote from the Wall. Which doesn't change the original idea as I perceive it.

alanfgh commented 4 years ago

Sorry, I didn't read the thread closely enough. @Inego and @AndiPersti, you're right. Part of my confusion was from the fact that MisterTrouser used the phrase "Mark sentences as 'not directly linked'", whereas "Mark sentences as 'not directly linkable'" would be clearer.

ckjpn commented 4 years ago

I've renamed this issue and opened a new one. https://github.com/Tatoeba/tatoeba2/issues/2031 Feature Suggestion: Be able to search for "indirectly linked, but not directly linked"

Poulpisator commented 4 years ago

I made the same mistake as Alan, my bad.

But then, I think that this feature request would be very harmful in the requested form. From my point of view, marking sentences as not directly linkable can be done only on the user level and not the corpus level. The reason is simple: Michel thinking that sentence A and sentence B cannot be linked together can be due to a lack of knowledge. While Bernard, with more knowledge in one or both of the languages involved, would link these two sentences.

This happens quite often on sentences already linked, so if we would allow people to mark as "unlinkable" without discussion, that could easily become a disaster.

Also, proofreading can only get better if several people "loose" their time on the same sentences... I proofread dozens of users' full list of sentences, and I wouldn't claim that the ones I proofread are good to go and nobody else needs to check their sentence. So one person alone deciding for their whole corpus seems very silly (and arrogant).

Inego commented 4 years ago

@Poulpisator, the same reasoning can be applied to the traditional action of direct linking of two sentences. It is also performed at the corpus and not the user level, and so a mistake of one user can affect search results for other users. For example, Bernard, looking for "untriaged indirect sentences", would miss the pair of sentences directly linked by Michel, because Michel erroneously thought they should be linked. So, in my opinion, the requested action of explicit "de-linking" is no more harmful than direct linking. Also, the search function could be enhanced by allowing independent filters, which could be set according to your heart's content:

Poulpisator commented 4 years ago

Personal opinion aside, I think the only good way to correctly implement this if you really want to is to somehow only indicate, on the page sentence, that an indirect translation was marked as "unlinkable". I prefer "unlinkable" to "de-linked" because anything not linked is de-linked ^^

Let me explain myself. We're not talking about some machine process here, but about human proofreading, so here are some differences and problems:

  1. Linking is about showing something to the rest of the world (the contributors). Marking as unlinkable is about hiding.
  2. Of course when you link something you maybe made a mistake but when people proofread sentences, they can proofread link at the same time, because they are here no matter what (except if you explicitly hid them but then you're not using the full potential of proofreading, see next points).
  3. Let's say you mark a sentence as unlinkable. What do you do with that? Do you hide the indirect translation from the sentence? No, because indirect translations are a very helpful way of proofreading. If you have French A <=> Spanish A <=> French B, and you know that French A and French B do not say the same, you can be suspicious about Spanish A. If you do not hide unlinkables, you need to mark them somehow, because adding them would only be parasitizing the page sentence. So maybe a broken chain link instead of the current icon or something else. Also, do we restrict marking as unlinkable to indirect translations only (I personally think we should).

Oh, and about searching for marked as unlinkable, let be realistic here, nobody will ever do that search except some depressed drunk corpus maintainer once in a blue moon. Of course you can say that it's only my personal opinion, but (That is not true if we have independent search filters at Inego mentioned) we do not have enough force to proofread the core of the corpus, so the way to proofread links is simply to proofread sentences. Hence, if unlinkables are displayed on the page sentence, we can proofread them, if not, they will never be proofread.

Inego commented 4 years ago

Hmm, this issue was not a fancy-in-itself thought flight, but came from struggles of concrete people doing concrete work. There is a queue of unsorted/untriaged pairs of indirect translations. Deciding whether any of them is, or is not, directly linkable, is work and its result deserves to be stored whether it is positive (action: "link") or negative (action: "mark directly unlinkable"). The benefit of the positive result, linking, is obvious. The benefit of the negative result is subtle: first, it will help advanced contributors not to make the same decision over and over again ("can I link the sentence? no. next!"). Second, when displaying all translations of a sentence, you could show directly unlinkable translations in a different way, as a hint to the user that these specific sentences cannot be used as direct translations of the original sentence. As it is now, both linkable and unlinkable indirect translations are displayed in the same way.

ckjpn commented 4 years ago

...this feature request would be very harmful in the requested form

I think I agree with this. However, perhaps if we all brainstorm ideas, we could come up with something that would be useful.

I think the only good way to correctly implement this if you really want to is to somehow only indicate, on the page sentence, that an indirect translation was marked as "unlinkable".

Possibly these could be shown by color-coding such sentences, similar to how some sentences are marked in red now. This would at least mean people wouldn't have to re-read these sentences when looking for indirect links that could be directly linked. Color-coding would also mean that this information would be shown on search results, rather than hidden.