Doublevil / JmdictFurigana

A Japanese dictionary resource that attaches furigana to individual words
150 stars 13 forks source link

27 "valid" entries present in the v1 not in the v3 anymore! #5

Closed yayoo1971 closed 7 years ago

yayoo1971 commented 7 years ago

Hello,

I've made some processing in my consolidated DB (entries from other sources than jmdict including some french pro/reliable data) and the following valid entries (27) present in the v1 have simply disappeared in the v3, here there are :

"御兄さん","おにいさん","0:お;1:にい","jmdict" "御姉さん","おねえさん","0:お;1:ねえ","jmdict" "御母さん","おかあさん","0:お;1:かあ","jmdict" "抑抑","そもそも","0:そも;1:そも","jmdict" "犇犇","ひしひし","0:ひし;1:ひし","jmdict" "険しい路","けわしいみち","0:けわ;3:みち","jmdict" "芝生","しばふ","0-1:しばふ","jmdict" "純日本風","じゅんにほんふう","0:じゅん;1-2:にほん;3:ふう","jmdict" "真珠湾","しんじゅわん","0:しん;1:じゅ;2:わん","jmdict" "草履","ぞうり","0-1:ぞうり","jmdict" "大和魂","やまとだましい","0-1:やまと;2:だましい","jmdict" "竹刀","しない","0-1:しない","jmdict" "東京湾","とうきょうわん","0:とう;1:きょう;2:わん","jmdict" "日本学者","にほんがくしゃ","0-1:にほん;2:がく;3:しゃ","jmdict" "日本製","にほんせい","0-1:にほん;2:せい","jmdict" "日本側","にほんがわ","0-1:にほん;2:がわ","jmdict" "日本刀","にほんとう","0-1:にほん;2:とう","jmdict" "日本風","にほんふう","0-1:にほん;2:ふう","jmdict" "木ノ葉","このは","0:こ;2:は","jmdict" "木ノ葉","きのは","0:き;2:は","jmdict" "余所見","よそみ","0:よ;1:そ;2:み","jmdict" "嗹","れん","0:れん","jmdict" "愈愈","いよいよ","0:いよ;1:いよ","jmdict" "偶偶","たまたま","0:たま;1:たま","jmdict" "益益","ますます","0:ます;1:ます","jmdict" "風邪薬","かぜぐすり","0-1:かぜ;2:ぐすり","jmdict" "日独協会","にちどくきょうかい","0:にち;1:どく;2:きょう;3:かい","jmdict"

Therefore, I recommend you to consolidate your DB and do some sql queries checks each time you download a new jmdict file or generate the furiganas as I don't know if the mismatch comes from the newer jmdict file itself or the jmdictFurigana processing.

Anyhow, let me take the opportunity of this issue report to thank you VERY MUCH for your GREAT contribution!!! Cheers.

Doublevil commented 7 years ago

Hello, Thank you for your detailed report, I'll look into these entries when I have some time. I'll try to investigate it next weekend. There's a small number of tests I do whenever I upload a file, but I'm not sure how I could prevent stuff like that from happening in the future. What do you mean by "consolidate your db"? I'm only using the JMDict XML file as a data source (plus some text files for special readings and the likes). At a glance, the issues you report seem to be related to switching the very small special readings database file for a more comprehensive one. It's just strange that it doesn't contain some of the most obvious ones like 姉 -> ねえ. Now for the 日本 case: I previously classified it as a special reading of 0-1:にほん, but noticing it wasn't in the comprehensive special readings list, I purposefully removed it from the list. I was pretty sure it would be considered as a 0:に|1:ほん though. My bad! I'll look into it as well to determine the correct instance.

BlueRaja commented 7 years ago

I'm not sure how I could prevent stuff like that from happening in the future.

Unit tests are the usual way

Doublevil commented 7 years ago

Unit tests are the usual way

Sure, and I wrote a couple. But I did not happen to have a test for any item on the list.

yayoo1971 commented 7 years ago

Have just downloaded the latest "JMdict_e" xml file (174.502 entries) in order to check wether or not the 27 entries are present.

御兄さん (first entry in above list) : present ---> entry 1001830 - rowid 131 (sqlite). 日独協会 (last entry in above list) : not present anymore

Additional random check :

木ノ葉 (an alternative reading of the entry "1534560" / rowed 46746--> 木の葉) is not present (but present in Tagaini Jisho 1.0.3 as an example which uses jmdict data).

This confirms that even if your new readings DB had some little influence in the processing, the jmdict file was also modified. There's no entry changes history it seems.

On my side, I simply consolidate the DB which I have ; understand : as time goes by, the DB simply grows.

Doublevil commented 7 years ago

I'm not sure I get what you mean, but if that can help, I do update the JMDict XML file every time I'm doing an update to get the most recent data (because JMDict is constantly evolving). Now of course everyone is free to compile the C# project, replace the JMDict XML file by a newer one, and run the batch to get a newer furigana output file.

yayoo1971 commented 7 years ago

Let me clarify. You're very right to download and use the latest jmdict db. If I've chosen to consolidate MY db because it is not only because I keep translating some entries in french but I also want to keep valid entries which have been removed from the jmdict for some reasons. And I also use data from other sources so I should compile your project in order to get what I need, yep.

fasiha commented 7 years ago

I’d hoped that JMDICT’s database tool would let you search for items that were deleted from JMDICT—maybe they left some rationale for why an entry was deleted but alas, I can’t find 日独協会 anywhere in it. It might be worth searching or writing to the mailing list to ask where the discussion about entry removal is? Thanks for bringing this to wider attention @yayoo1971.

Doublevil commented 7 years ago

This is fixed in release 1.4. It was related to multiple issues with the new "special readings" source introduced in 1.3. There may be other related errors that have slipped through though, so don't hesitate if you catch one.