Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
697 stars 132 forks source link

Divide what are now called comments into "annotations" and "suggested corrections" and maybe more. #1830

Open ckjpn opened 5 years ago

ckjpn commented 5 years ago

JeanM's suggestion here https://tatoeba.org/eng/wall/show_message/31516#message_31516 would likely improve the website.

Perhaps, it would be useful to divide comments into ...

  1. Needed or suggested corrections.
  2. Annotations
  3. Other: social stuff like "thank you" or "welcome to the project" and other things.

This could be done by adding more fields, or perhaps just by having a way to designate what kind of comment was being made. (drop-down menu or radio buttons)

On top of this page, I mention several types of comments we have.

http://tatoeba.byethost3.com/stats-190302-comments.html Stats - 2019-03-02 - Number of Comments by Members

  1. If suggested corrections were a separate field, then it might even be possible to have check boxes that could be clicked when the suggested correction was dealt with. a. corrected b. correction not needed Both the owner of the sentence and corpus maintainers would need to be able to check these boxes.

  2. If after 2 weeks there were no response, perhaps it could automatically be tagged "@change or delete" and put on the list for corpus maintainers. Perhaps the tagging wouldn't even be necessary.

Perhaps an annotation icon could be placed near the sentence so it's easily visible when one or more annotations exist. It could be clickable to jump to the annotation on the page.

  1. Maybe some of the other comments, not really related to the sentences, could even be automatically deleted after a few weeks.
jiru commented 5 years ago

I think your suggestion is a good idea, but I do not interpret JeanM’s suggestion this way. It seems to me that he is talking about part-of-speech tagging, with the goal of easing automatic replacement of proper nouns. In his idea, annotations would be linked to one or several words of the sentence, as opposed to staying in a separate comment.

jeanm commented 5 years ago

I was essentially suggesting named entity recognition (NER), which can already be done automatically with fairly high accuracy for resource-rich languages.

See here an example of automatic recognition by spaCy (which is open source) for two random sentences I just got off Tatoeba: English sentence 36212 and its German translation 2193164. You'll notice the two entities are recognised in both cases, although the German model mistakes Nancy for the French city in Lorraine :)

Sentences could be manually verified by e.g. native advanced contributors before being considered valid – or you might want a whole new class of "members who have been trained in NER manual annotation". Annotations could be fixed using an interface such as Prodigy which is super easy (click on the label you want to assign, then select by dragging with your mouse) although sadly not open source. Languages that don't currently have NER support could be annotated manually initially, or exploiting the annotation of parallel sentences in compatible scripts which have already been annotated. For example: if Laërte has been tagged as a person in a French sentence, we can be almost certain that the token Лаэрт in its Russian translation is also a person. Given enough manual annotations (I'm guessing somewhere between 1k and 10k) an NER model could then be trained for those languages too.

jiru commented 5 years ago

@jeanm Thanks for your detailed explanation. This is really interesting.

To be honest, we’d rather put efforts into easing the collaboration with other projects who’d like to perform NER or whatever processing on the sentences, rather than implementing it within Tatoeba. We barely have enough resources to maintain the core features of Tatoeba, so this kind of enhancement is currently way out of scope. We already implemented "extra" features like audio recordings, transcriptions, ratings etc. and the interface became cluttered with too many extra buttons. We can’t and don’t want to do things this way any more.

If you are to start an NER project based on Tatoeba and maintain it, we’d be more than happy to collaborate. If not, we’re not the ones going to do it.

ckjpn commented 5 years ago

Here are a couple of demos that hint at what might be possible if this were done.

http://tatoeba.ueuo.com/annotations.html Annotation Comments on English (Updated: 2021-01-16)

http://tatoeba.ueuo.com/related.html "Related" Comments on All Languages (2019-04-27)

EDITED 2020-11-07. I uploaded the files to another server, since the pervious server was offline.

jiru commented 5 years ago

I think comments would be definitely better with from some kind of categorization or filtering, but I’d like that we focus on what is the problem we’re trying to solve instead of your proposed solution. Other than "annotations" and "suggested corrections", there could be many ways to categorize the comments. We could even have a separate form to suggest corrections, or an "I think that sentence needs to be changed" kind of checkbox. So my first question is: in what ways the current comments are bothering you so that you’d rather have them categorized? I suggest that you (or anybody else) start a new thread on the Wall so that other members are more likely to participate, so that we can ultimately come up with a solution that benefits to the majority of the members.

ckjpn commented 4 years ago

TRANG, on the wall talking about the ratings. https://tatoeba.org/eng/wall/show_message/34038#message_34038

For instance, when you click "not OK", you could have a form to add your suggested improvement and wouldn't have to go to the sentence's page for that.

Maybe one type of comment could be "correction suggested" or "correction needed" that is triggered by the "not OK" rating being triggered as TRANG suggests.

Once a correction has been made, and the rating changed to "outdated," make it easy for the person who made the suggestion to remove the no longer needed comment.

ckjpn commented 4 years ago

One problem with the way things are now is that the @change tag and the suggested correction comment(s) get moved to a lower-numbered sentence that was never incorrect in the process of merging duplicates.

tags comments

Dividing what are now called comments into "annotations" and "suggested corrections" could help solve this problem, since you could have Horus not move the "suggested corrections" comments.

Additionally, I think that perhaps Horus shouldn't move tags like @change and @delete.

https://tatoeba.org/eng/sentences/show/316622 She sang with a beautiful voice.