Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
712 stars 132 forks source link

Segregate sentences created as translations from originals #1589

Closed jiru closed 6 years ago

jiru commented 6 years ago

We want to have a way to tell wether a sentence has been created as a translation of another, or as an original sentence (from the /sentences/add page for example).

The license of original sentences is simpler to handle because they are presumably not derived from any other work.

Related issue: #73.

trang commented 6 years ago

We had a meeting where we discussed various cases of when is a sentence original and when it is derivative work. I'm documenting the cases here.

We would need confirmation from a legal expert whether or not the assumptions we are making are correct. There may also be other cases we need to consider.

Case 1

User A creates Sentence A. User B creates Sentence B as a translation of Sentence A. → Sentence B is a derivative work of Sentence A.

Case 2

User A creates Sentence A. User B creates Sentence B. User C links Sentence A to Sentence B. → None are derivative work. They are both original sentences.

Case 3

User A creates Sentence A. User B creates Sentence B. User A links Sentence A to Sentence B. → None are derivative work. They are both original sentences.

Case 4

User A creates Sentence A. User B creates Sentence B. User B links Sentence A to Sentence B. → None are derivative work. They are both original sentences. Technically, User B could be hiding the fact that Sentence B is not an original sentence, but we cannot take responsibility for that.

Case 5

User A creates Sentence A. User B creates Sentence B as a translation of Sentence A. User B unlinks Sentence B from Sentence A because they realized they made a mistake. → Sentence B is still a derivative work of Sentence A.

Case 6

User A creates Sentence A. User B creates Sentence B as a translation of Sentence A. User C unlinks Sentence A from Sentence B because they are not correct translations of each other, but each sentence on its own is a correct sentence. → Sentence B is still a derivative work of Sentence A.

jiru commented 6 years ago

I am not sure how to deal with sentences that were added at the very beginning of the project. The log table indicates that sentences 1 to 330929 were first inserted without linking information, and then all their links have been added at once, in the order of sentence IDs. I believe that’s because Tatoeba doesn’t use to have the concept of links at the beginning. It looked like the sentences in a given group were all linked to each other.

In the log table, among sentences with id ≤ 330929 (I’ll call them the pre-link-era sentences) the vast majority of the sentences have a creation datetime of zero (and no author), while a few of them have a valid creation time, and these creation time are not ordered. So I have two questions:

trang commented 6 years ago

Indeed at the beginning, the sentences were grouped via their id. If they all had the same id, they had the same meaning.

Actually it's possible to find out about these sentences' history. I have a backup of the old data, and it has logs. I've uploaded the file to our server, you'll find it where our daily dump is, @jiru.

jiru commented 6 years ago

Thank you, I feel like touching some kind of relic!

This dump (which, contrary to its name, is from around the 17th of January, 2009) contains exactly all the pre-link-era sentences, and is based on a previous version of Tatoeba that is not included in the current repository history. I’ll answer myself:

Trivia: sorry to ruin the myth, but according to this dump, the current sentence #1 is not the very first one. Sentence 1 has been added on the 30th of September, 2007, and the original sentence of that group is actually sentence #5972 (originally added by MARIE on the 14th of July, 2007). The first sentence having a creation log in that dump is #5942 (originally added by you, Trang, on the 12th of July, 2007). It doesn’t necessarily means that was the very first sentence, but at least that’s the oldest sentence which we recorded the creation.

jiru commented 6 years ago

I think we can ignore pre-link-era sentences for our first iteration, and I’ll open another issue about segregating pre-link-era sentences.

trang commented 6 years ago

Just a note regarding these pre-link-era sentences. The majority of them (80%, 90% maybe?) are from the Tanaka Corpus.

Assuming we are able to identify which sentences are from the Tanaka Corpus (I'll check that), there would be 2 possibilities: 1) We mark all the Tanaka Corpus sentences as "original" (i.e. both English and Japanese) because we just have no way to know which sentence really came first. 2) We decide arbitrarily which language is the original language between English or Japanese, and we consider that every sentence of that language is the original one.

Professor Tanaka published a paper about how he collected the sentences. I actually never read it before today. If I understand properly, he asked his student to gather sentences from bilingual newspaper articles on the Web. He mentions CNET as one of the resources, which leads me to think that most of the sentences are originally English ones.

The paper also mentions copyright issues, which makes me wonder if we can/should make these sentences CC0.

jiru commented 6 years ago

Assuming we are able to identify which sentences are from the Tanaka Corpus

I took note that IDs of sentences from the Tanaka corpus seem to be 15795 to 237063.

he asked his student to gather sentences from bilingual newspaper articles on the Web

I thought these sentences were genuinely written by students. If they were copied from newspapers, I don’t think it is compatible with CC-BY in the first place. It’s like copying sentences from newspapers directly into Tatoeba. How is that possible that the Takana corpus was originally licensed under the public domain? I’m quite confused.

trang commented 6 years ago

How is that possible that the Takana corpus was originally licensed under the public domain?

I wouldn't be able to answer this, but I will ask Jim Breen and Francis Bond. Perhaps one of them has more insight about the license (or absence of license) on the Tanaka Corpus.

There is perhaps also a different interpretation of "public domain" in Japan. Who knows.

trang commented 6 years ago

Okay so I misunderstood the paper. Jim Breen clarified to me that the Tanaka Corpus (212,000 sentences) were not copy-pastes from online newspapers. These sentences were crafted by students. Some of them may have been copied from somewhere, but we wouldn't be able to trace that.

The copyright issues that were mentioned in the paper only refers to the data collected from online newspapers (~16,000 sentences). But this set of data was not, as far as we know, included in the corpus that was released under the public domain.

Both Jim and Francis agree that we can quite safely go ahead with CC0.

ckjpn commented 6 years ago

We mark all the Tanaka Corpus sentences as "original" ....

Would they be considered "original" if corrected by members who assume their work is CC-BY? What I mean is can these sentences ever then be reverted back to public domain? Is that why you want to mark them as "original?"

trang commented 6 years ago

Would they be considered "original" if corrected by members who assume their work is CC-BY?

Note that the license and the originality of a sentence are two different things.

For the originality, I think we can assume that when a user corrects a sentence, they are okay with leaving the "full rights" to whoever is the owner of the sentence. In other words, if you correct one of my sentences, you won't be claiming that it's now your sentence. You'll be fine if it still remains my property.

So yes, be default, a sentence will still be considered "original" if it is corrected by someone else.

What I mean is can these sentences ever then be reverted back to public domain?

Yes, they can.

There might exceptions, but we're not handling those cases for now.

Is that why you want to mark them as "original?"

The main reason of marking sentences as "original" is because it's easier to change the license of an original sentence than the license of a translations, since translations are derivative work.

We're not marking sentences as original specifically for the Tanaka Corpus. This is for everyone. I'm thinking for instance about the person who contacted us for their sentences in Tatar. Some of their sentences may be original, some of their sentences may be translations. It's a bit difficult for them to manually extract the original ones, to donate them to Common Voice.

jiru commented 6 years ago

I found a new problem: many sentences have more than one creation date. There are currently 65765 of such sentences on tatoeba.org. Most of them with two creation records, but sometimes more. An extreme example is sentence 16492 with 14 creation records.

Here is a simple query to list all of them:

select sentence_id as sentence, count(sentence_id) as insertions
from contributions
where action = 'insert' and type = 'sentence'
group by sentence_id
having insertions > 1;

It’s a problem for the current ticket because depending on the creation record you’re looking at, a sentence may be seen as an original or as a translation. The highest ID of such sentences is 2480810, added in 2013, so I guess it was a temporary bug and we’re not producing them any more. Maybe it’s a result of this old deduplication script?

trang commented 6 years ago

It looks very much like it was a result from the old deduplication script.

In the case of "What are you doing?", it's a rather common sentence, so no surprise that it's been added many times.

I think you can simply take into account the date of the first creation record. The other records only mean that the sentence was added again later, as a duplicate.

Guybrush88 commented 6 years ago

Concerning case 2, this Italian sentence https://dev.tatoeba.org/eng/sentences/show/3190735 (which is much older than the English one) has the text saying that it's not possible to determine whether it's original or not, and the same text appears for the English one: https://dev.tatoeba.org/eng/sentences/show/5757823, and both sentences should be considered as original, since they were added as original sentences in different times, and the Italian sentence was linked in a later moment because they match and I noticed in a much later moment that there was a later original sentence that matched my earlier original sentence.

jiru commented 6 years ago

Guybrush88, thanks for reporting this, but I think you happened to look at these sentences while I was running yet another pass of original/translation detection of the script, so most of the sentences were displaying "unknown" until it was completed. Now these two sentences are reported as original.

Guybrush88 commented 6 years ago

I understand, thanks for your reply @jiru

trang commented 6 years ago

Script is running. Waiting to see on Monday if everything went okay before closing this.

ckjpn commented 5 years ago

Here is another case that may need to be considered.

Here is a German sentence, that was corrected, so it took away the ownership from another member. In this case, they were both translations and not originals, but this same type of thing might in which the "original" sentence's ownership is taken away by a non-original.

https://tatoeba.org/eng/sentences/show/643837

jiru commented 5 years ago

@ckjpn I’m not sure I understand your comment.

Are you talking about the case where:

  1. User A contributes an original sentence.
  2. Corpus maintainer B changes the whole content of the sentence (not just a minor edit).
  3. Now the sentence is displayed s original and belonging to A, whereas in practice it’s more like an original sentence belonging to B. We do not handle this case as it’s possible for A to change the license of such sentence. But I think B should either not do such edit, or "steal" the ownership (if the user is marked as inactive for example).

Or maybe you’re talking about this case:

  1. User A contributes an original sentence.
  2. User A unadopt the sentence.
  3. User B adopts the sentence and edits it.

We already handle this case by simply not allowing B to change the license of such sentence.

ckjpn commented 5 years ago

User A contributes a sentence with an error. User B contributes the same sentence, at a later date, without the error. User A either corrects the sentence or unadopts it and another member adopts it and corrects it.

It now matches the correct sentence that User B contributed and Horus deletes it.

If User B's sentence was originally submitted as CC0, and User A's sentence is CC-BY, then a CC0 sentence becomes a CC-BY. The reverse could happen, too.

I don't know if this is a big problem, but it's something that likely needs to be considered.