JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
16 stars 1 forks source link

Example sentence keywords and furigana #99

Closed stephenmk closed 10 months ago

stephenmk commented 10 months ago

I think the ex_text information provided in the JMdict_e_examp file is intended to allow developers to highlight entry keywords within example sentences. This works well most of the time, but I noticed there are some situations in which the information provided by ex_text alone isn't enough to identify the correct portions of the sentences.

Would there be an easy way to include extra information in the file to resolve these ambiguities? Furigana information would be nice too if that is available.

san

stephenmk commented 10 months ago

Seems there are 222 unique example sentences in which the key appears more than once.

A few more examples

にん【人】 ![nin](https://github.com/JMdictProject/JMdictIssues/assets/8003332/f5537ab7-c2e0-4e6c-840d-f493d0601124)
ぎょう【業】 ![gyou](https://github.com/JMdictProject/JMdictIssues/assets/8003332/c4c65c72-999f-4509-abe3-44274c7d10c3)
ほ【歩】 ![ho](https://github.com/JMdictProject/JMdictIssues/assets/8003332/a3837923-4676-4f9b-8fb1-9e2860c3cb10)
JMdictProject commented 10 months ago

I think the ex_text information provided in the JMdict_e_examp file is intended to allow developers to highlight entry keywords within example sentences.

It was more to provide a version of JMdict with an example sentence from the Tanaka/Tatoeba collection using the entry term. What developers do with it is up to them.

This works well most of the time, but I noticed there are some situations in which the information provided by ex_text alone isn't enough to identify the correct portions of the sentences.

It was never intended to do so. To achieve that the Japanese sentence needs to be segmented and the term identified. This information is available in the source file from Tatoeba.

Would there be an easy way to include extra information in the file to resolve these ambiguities?

It can be found in the Tatoeba system using the sentence number provided. I guess it would be possible to include that information in the JMdict_e_examp if there was enough interest,

Furigana information would be nice too if that is available.

It's not available in the source files, and is way beyond the intention of providing example sentences. Developers would have to do their own thing there.

Is there interest in adding the indices to the JMdict_e_examp file? They are documented at: https://www.edrdg.org/wiki/index.php/Sentence-Dictionary_Linking

stephenmk commented 10 months ago

I didn't think to use the indices in the separate examples file. Thanks, that should be enough to get what I'm interested in.

It might be slightly more convenient to have the indices included in the JMdict_e_examp file, but it's not a big hassle to fetch the separate file either.

They are documented at: https://www.edrdg.org/wiki/index.php/Sentence-Dictionary_Linking

Looks like the links to the files need to be updated to point to the new edrdg ftp server.

JMdictProject commented 10 months ago

Thanks for pointing out that those links were to the former Monash FTP site. I've amended them now. I think this issue can be closed.

stephenmk commented 8 months ago

Furigana information would be nice too if that is available.

It's not available in the source files, and is way beyond the intention of providing example sentences. Developers would have to do their own thing there.

Just want to note that furigana information is in fact available on Tatoeba for nearly all priority-tagged Tanaka corpus sentences (just a couple dozen are missing; I might go fill them in myself). Not sure how many were done by a machine, but it looks like ~the majority~ a fifth of the records have a user name attributed.

The file can be fetched from their downloads page in the "Transcriptions" section at the bottom.

JMdictProject commented 8 months ago

Yes, there is furigana information within Tatoeba but it's not included in the sentences+indices file which is the basis of the ex_text elements. I guess if someone were keen enough they could merge the two.

AFAIK the furigana information within Tatoeba has been automatically generated using a morphological analyzer. I think there is a facility for later human edits. I don't think the person whose name is associated with any particular sentence has been responsible for the furigana.

stephenmk commented 8 months ago

AFAIK the furigana information within Tatoeba has been automatically generated using a morphological analyzer.

I believe that is correct. When you browse to this information on specific Tatoeba entries, the website will warn you that unreviewed furigana information has been generated by a machine.

I think there is a facility for later human edits.

This information can indeed be added, edited, and reviewed on Tatoeba.

Regarding the names associated with sentences, Tatoeba says this:

A username associated with a transcription indicates the user who last reviewed and possibly modified it. A transcription without a username has not been marked as reviewed.

I was mistaken when I said the majority of the priority Tanaka sentences have usernames. Of the roughly 25,800 unique sentences included in the JMdict_e_examp file, only about 5,500 have furigana that have been reviewed by users.

stephenmk commented 7 months ago

After many trials and tribulations, I have furigana and keyword highlighting functioning to my satisfaction.

More details here: https://github.com/stephenmk/Jitendex/discussions/21

san