Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
678 stars 131 forks source link

Extensions to the Japanese sentence index tool #3093

Open JMdictProject opened 6 months ago

JMdictProject commented 6 months ago

Context Tatoeba has a tool for editing the term indices associated with about 140,000 Japanese sentences (see https://tatoeba.org/en/sentence_annotations/show/210739). These indices are used by many apps and sites (such as Jisho.org and WWWJDIC) for linking sentences to dictionary entries, usually in the JMdict dictionary. The editing tool is great, but there are a couple of areas where additions would be a significant help. As people know, Trang is not in a position to make these changes, and she has suggested raising the matter here.

Extension 1: extending the search-and-replace function The tool has a global search-and-replace function which is very useful. Unfortunately, it is not possible to specify that a search term is at the start of the set of indices, For example about 130 sentences start with the single-kanji noun "車" (kuruma/sha, meaning car, vehicle, etc.). See: https://tatoeba.org/en/sentence_annotations/show/148936 and https://tatoeba.org/en/sentence_annotations/show/148941 It is not currently possible to search for sentences starting with 車 alone as it will find all the sentences with multi-kanji terms finishing with 車. What would be great is the ability to state that the search term has no preceding characters, e.g. something like ^車 in a regular expression.

Extension 2: having a log of changes More people are now using the tool, and we are coming across examples of mistakes being made in edits to the indices. There is currently no way of knowing who made the changes, or when they occurred. Some sort of changelog that could be downloaded weekly, and contains the from/to details and the person who did it would be a great help.

jiru commented 5 months ago

Hello, welcome, you came to the right place.

Although I contributed for years on the source code of Tatoeba, I never had a chance to get to know about this sentence indices functionality. I think I understand a bit better now, but I am curious about the process, like how do you keep the indices synchronized with the sentence text, who's maintaining them, how do you synchronize indices with jisho.org and other projects, are there new sentences added to the indices...

I am also curious about the history behind it. I assume these indices came around the time the Tanaka corpus was integrated in Tatoeba, is that correct?

About extending the search and replace, I think that's doable, but I am not sure about the approach. It is rather easy to implement "starts with" and "ends with" filters, but any other kind of filtering would require a whole rewrite of the functionality. Do you think you will need other kind of filters in the future?

Also, the "Tatoeba vs. Japanese Indices" functionality overlap is bugging me. You are searching for 車 in order to match it as a single character. I can see some furigana as well in those indices. If I understand correctly, you are doing tokenization and POS recognition using these indices? Are you doing all this by hand? For your information, Tatoeba already maintains furigana for every Japanese sentence. It has no tokenization yet, but this is something Tatoeba users could really benefit from. In particular, it would allow to search on Tatoeba for Japanese verbs/adj in dictionary form, single-character words etc.

Note that I am very happy that Tatoeba is used by other projects; I am just trying to figure out how this reuse of content is actually done and how thing could be improved.

As for maintaining a log of changes, I think it would be quite difficult to implement.

Note: you asked for two different things, next time please open two separate issues for each thing. :+1:

JMdictProject commented 5 months ago

The background can be found at: https://www.edrdg.org/wiki/index.php/Tanaka_Corpus

As it explains, the indexing and initial use in dictionary projects occurred before it was incorporated into the Tatoeba project. The segmentation of the Japanese sentences is quite independent of Tatoeba's furigana markup.