Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
704 stars 132 forks source link

Add a special tag to the sentences from the Tanaka Corpus #2098

Closed alanfgh closed 3 years ago

alanfgh commented 4 years ago

There's a discussion going on about the handling of sentences from the Tanaka Corpus within Tatoeba (see sentence #329287). I think there are at least two reasons why these sentences should be specially tagged:

(1) The way they were produced is distinct from the way that sentences are currently added to the corpus, so it would be helpful to be able to see, when looking at a particular sentence, whether it was produced in this way.

(2) There exists a special index for these sentences, and they are meant to satisfy a special goal (providing examples of usage of particular terms). This suggests that they should be handled differently.

As can be seen from the exported index file, there are 150718 sentences at Tatoeba that belong to the Tanaka Corpus. The file contains their IDs and the text of the Japanese member of each pair.

Could we perform a one-time operation to tag these sentences?

trang commented 4 years ago

Done.

https://tatoeba.org/eng/tags/show_sentences_with_tag/4970

alanfgh commented 4 years ago

Excellent! That was quick!

trang commented 4 years ago

That's something I wanted to do myself a while ago. I just took this occasion to finally do it :)

Guybrush88 commented 4 years ago

probably something went wrong, since there's a weird amount of pages compared to the number of sentences while browsing all the sentences with that tag (https://tatoeba.org/ita/tags/show_sentences_with_tag/4970), since there are 30.144 pages for 301.436 sentences, and the last ones show cryptic sentences with an unknown language and, as the text of the sentences, "Click to edit" (for example here: https://tatoeba.org/ita/tags/show_sentences_with_tag/4970?page=30144), and opening each empty sentence leads to a random one in any possible language

AndiPersti commented 4 years ago

There's something weird going on with that tag search: If you look at page 20 of the tag search sorted by sentence id (not logged in or with 10 sentences/page) you'll notice that sentence 2307 appears four times but only the first box shows all translations. There are four different Japanese translations in the Tanaka corpus for this sentence (I downloaded example.utf from http://www.edrdg.org/wiki/index.php/Tanaka_Corpus#Downloads):

$ grep -F 'He laughs best who laughs last.' examples.utf
A: 早まって喜ぶな、最後に笑える者が勝ち。       He laughs best who laughs last.#ID=2307_140534
A: 最後に笑う者の笑いが最上。   He laughs best who laughs last.#ID=2307_170570
A: 最後に笑う者が最も良く笑う。 He laughs best who laughs last.#ID=2307_170571
A: 最後に笑う者が一番よく笑う。 He laughs best who laughs last.#ID=2307_170572

And just below, 2308 appears twice. (There are two Japanese translations for that sentence in the Tanaka Corpus).

So either a bug in the code or some problems in the database.

trang commented 4 years ago

Well I rushed this one a bit too much.

In my mind it was just a matter of tagging all the sentences which id is equal to the sentence_id or to the meaning_id in the sentence_annotations. But the reality is that in some cases, the meaning_id is not set to an actual sentence id but to 0 or -1. That's why there are all these "empty" sentences. The other reality is that not all sentences in the sentence_annotations are from the Tanaka Corpus. The maintainers of these "sentences_annotations" added indices for sentences that were contributed by Tatoeba members. So it would be a bit incorrect to tag those as "Tanaka Corpus".

So I dug into the depth of my old hard drives and found a backup of the Tatoeba database around the time where it has just been migrated to the structure that it has today. Thanks to some old notes and to some old emails, I was able to remember how the Tanaka Corpus was handled and I was able to extract the sentences that were imported into Tatoeba when it was decided that Tatoeba would be the new home for the Tanaka Corpus.

The list of IDs is here: https://gist.github.com/trang/1be65f150c5c11f2ab87b2ccf6d6ab1a

For the record, there was an old Tatoeba version where sentences were stored in a table called dico. In this table, sentences were linked with each other via their ID. So all sentences with id = 1 had the same meaning. The sentences with id = 1 to 166500 were reserved to the Tanaka Corpus.

When Tatoeba was rewritten more of less from scratch using CakePHP, the dico table was transformed into sentences and sentences_translations. In the sentences table, there was a field dico_id to keep track of the id that the sentence had back then. A query in this table for dico_id = 166500 funnily has the sentence "Sentences past this ID are sentences added by contributors of Tatoeba Project."

MariaDB [tatoeba_test]> select * from sentences where dico_id = 166500;
+--------+------+--------------------------------------------------------------------------------+-------------+---------+---------------------+----------+---------+
| id     | lang | text                                                                           | correctness | user_id | created             | modified | dico_id |
+--------+------+--------------------------------------------------------------------------------+-------------+---------+---------------------+----------+---------+
| 119751 | en   | Sentences past this ID are sentences added by contributors of Tatoeba Project. |        NULL |       5 | 2007-09-30 16:59:47 | NULL     |  166500 |
+--------+------+--------------------------------------------------------------------------------+-------------+---------+---------------------+----------+---------+
1 row in set (0.11 sec)

So from there I extracted all the sentence id's of English and Japanese sentences where dico_id < 166500.

MariaDB [tatoeba_test]> select id from sentences where dico_id < 166500 and lang in ('en', 'jp') into outfile '/tmp/tanaka_sentence_ids.csv';
Query OK, 300814 rows affected (0.20 sec)

And that's the Tanaka Corpus. I will be updating the tags accordingly tomorrow or the day after.

fjay69 commented 4 years ago

It's gone. https://tatoeba.org/rus/tags/show_sentences_with_tag/4970

trang commented 4 years ago

That's because I haven't re-tagged the sentences yet. My attempt at it yesterday led to crash the website due to reasons I'm not entirely sure of. I don't know yet when I will attempt again.

If someone needs to browse the Tanaka Corpus sentences in the meantime, just know that you can always check them on the dev website where I managed to tag without issues: https://dev.tatoeba.org/eng/tags/show_sentences_with_tag/4970 But there will of course be some differences compared to the production website.

For the record, the queries I've used: https://gist.github.com/trang/f4b90f9649db6186bd708e353bf318ee https://gist.github.com/trang/f4b90f9649db6186bd708e353bf318ee

jiru commented 4 years ago

@trang What if I want to set the based_on_id attribute for these sentences? Is there a way to tell in which way the pairs were originally translated (from English to Japanese or the opposite)? What about sentences that do not have a link to the other language like #330622?

trang commented 4 years ago

I don't think it would be easy to find out which language came first. There is no information about that in professor Tanaka's paper, nor in the EDRDG wiki page about the Tanaka corpus.

For sentences that don't have a a link to the other language, I don't have information either. Before Tatoeba became the new home for the Tanaka Corpus, the sentences were periodically re-imported. In the dico table, there was a range of IDs dedicated to this corpus and all sentences in this range were deleted, then re-inserted from the export of the database where these sentences were maintained. Tatoeba has no log of the modifications or deletions made on these sentences, at that time.

AndiPersti commented 3 years ago

I've looked into this issue and noticed two things:

  1. In csv_tanaka there are 13547 sentence ids that don't exist anymore in sentences:

    MariaDB [tatoeba]> select count(*) from csv_tanaka ct left join sentences s on ct.id = s.id where s.id is null;
    +----------+
    | count(*) |
    +----------+
    |    13547 |
    +----------+
    1 row in set (1.422 sec)
  2. There are 14 sentences in csv_tanaka whose language is neither English nor Japanese. I guess these shouldn't be tagged, should they?

    MariaDB [tatoeba]> select s.id, s.lang, s.text from csv_tanaka join sentences s on csv_tanaka.id = s.id where s.lang not in ('jpn', 'eng');
    +--------+------+-----------------------------------------------------------------------------------+
    | id     | lang | text                                                                              |
    +--------+------+-----------------------------------------------------------------------------------+
    |  18744 | spa  | Por favor mantente cercano.                                                       |
    |  18838 | epo  | Laboremo igis Jack tio, kio li estas.                                             |
    |  18841 | epo  | Laborema homo havos sukceson en la vivo.                                          |
    |  18847 | epo  | Oni ne povas esti sukcesa, se oni ne estas laborema.                              |
    |  18848 | epo  | Li malsukcesos en sia nova projekto, krom se li estos laborema.                   |
    |  18849 | epo  | Oni ne povas esti sukcesa se oni ne multe laboras.                                |
    |  18850 | epo  | Laboremo estis la ĉefa faktoro de lia nekredebla sukceso.                         |
    |  23029 | ita  | Impariamo molto dall'esperienza.                                                  |
    |  23904 | epo  | La fajro ekbrulis.                                                                |
    | 146680 | por  | Uma menina chorando abriu a porta.                                                |
    | 164914 | nld  | Op onze website, http://www.example.com, staat alle informatie die je nodig hebt. |
    | 170564 | tur  | Devenin belini kıran son saman çöpüdür.                                           |
    | 240196 | fra  | Transmets mes amitiés à ta famille.                                               |
    | 267622 | spa  | Las novelas no se están leyendo tanto como solía ser.                             |
    +--------+------+-----------------------------------------------------------------------------------+
    14 rows in set (1.710 sec)
trang commented 3 years ago

Thanks for looking into this, @AndiPersti!

  1. I'm not too surprised with 13k sentences no longer existing in sentences, considering that they have been deleted over the past 10+ years. The Tanaka Corpus had many weird sentences that were hard to understand and/or hard to fix, so many ended up just being deleted. Also, many of the Tanaka sentences probably still exist, but not under the same ID, because someone created a duplicate. Our deduplication algorithm will favor the sentences that have an owner over the orphan sentences and most of the Tanaka sentences were orphans. We could try to track down the duplicates that were kept and mark them as "Tanaka Corpus", but not sure if it's worth the effort. In any case, that can be done another time.

  2. The 14 sentences that are neither in English nor Japanese should indeed not be tagged.

AndiPersti commented 3 years ago

I've used the following command without problems:

MariaDB [tatoeba]> insert into tags_sentences (tag_id, user_id, sentence_id, added_time) select 4970, 5,
s.id, now() from csv_tanaka ct left join sentences s on ct.id = s.id where s.id is not null and s.lang in
 ('eng', 'jpn');
Query OK, 287253 rows affected (1 min 14.455 sec)
Records: 287253  Duplicates: 0  Warnings: 0

I've also updated the indexes so that the sentences are found using the advanced search: Advanced search for English English sentences with tag "Tanaka Corpus"