JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
16 stars 1 forks source link

Repetitions in Russian definitions #38

Closed ghost closed 2 years ago

ghost commented 2 years ago

In Russian definitions we have a lot of duplicating text. For example for word 私用 we have:

Screen Shot 2021-09-06 at 10 10 26

I highlighted text which included twice with same color. It seems like definitions were taken from the same source, but added twice.

Here's another example Word 遠慮:

Screen Shot 2021-09-06 at 10 23 08

Example of a word with multiple 'senses' but without repetitions Word 詰まり:

Screen Shot 2021-09-06 at 10 36 17

Example of a word with single 'sense' (not having 1) ... 2) ...) Word 言い換える: PS: For single 'sense' we never have any repetitions.

Screen Shot 2021-09-06 at 10 37 45

What we can do:

  1. If word has more than one 'senses' (e.g: having 1) ... 2) ...) we can just erase everything which is before 1)
  2. If word has only one 'sense', it won't have any repeated text, so we can leave it as is

It would also be nice if we'll split 'senses' the same way as for English language. It can be done easily. Splitting 'glosses' probably wouldn't be that easy, so we can left it as is for the moment.

Please write your thoughts. If all good, I can write a script and submit a patch for these changes by myself.

JMdictProject commented 2 years ago

Some background on this. The Japanese-Russian data comes from the Warodai project. There's a short overview at https://www.edrdg.org/wwwjdic/wwwjdicinf.html#dicfil_tag (scroll down to the Japanese-Russian section.) The material in JMdict comes straight from an "EDICT" format conversion done by Vitaly Zagrebelny. The page linked above has links both to Vitaly's page and the Warodai project page.

I suspect what you have identified is a result of Vitaly merging two or more of the original entries. The original dictionary format is like this:

わかれ【別れ・分かれ】(вакарэ)〔1;81;33〕 1) отделение; ответвление, разветвление; рукав (реки); 2) расставание, разлука; 別れの盃 прощальный бокал; 別れを告げる прощаться.

I think the best course is to contact Vitaly and raise the issue with him. As it's a file derived from another project, I'm reluctant to make local changes here. If you can't contact him via his page, I have a recent email address

Jim

ghost commented 2 years ago

@JMdictProject Thank you for quick reply. Sure, I can try to contact him. But just to be sure, I'd like to clarify one thing before.

You said, you're using Russian definitions from EDICT format. So you probably have some kind of script to import this file, right?

I inspected this file and compared data with what we get in JMdict as result. Here it is:

Screen Shot 2021-09-06 at 13 05 18

Here we already have all definitions in 私用 line/entry. Lines 私用する, 私用の, 私用で is just excessive information.

So there is possibility that in your 'import script' if you cannot find entry for 私用する in JMdict, it just drops する part and tries to find 私用. If found, it inserts definition from 使用する. But it is already here.

And probably you have the same process for 〜の (and maybe for some other particles like 〜な etc)

Maybe it will be easier to just ignore those conditions in your import script? 🤔

JMdictProject commented 2 years ago

Yes, you are quite correct. The script that imports the translations from the other dictionaries is fairly aggressive about pulling in translations from Xする, Xの, etc. entries because some of those dictionaries have things like 形容動詞 entries recorded as XXな, and so on. That's what's happening here.

I can turn off that feature selectively, so that it just matches the base JMdict/EDICT forms. I have tested that approach with the Russian import, and it certainly cuts down significantly on duplicated glosses.

There are still a few issues:

The next JMdict will be generated in about 2 hours from now. Have a look and see what you think of the changes to the Russian parts.

ghost commented 2 years ago

@JMdictProject Wow! This looks so much better now! Thank you 🙇‍

about 200 entries don't get imported because their forms don't match the JMdict ones

It's not for me to decide, but I think losing 200 entries (out of about 65 thousands, right?) is not that important compared to huge improvement in definitions quality. And we probably already had much bigger amount of words (from this EDICT source file) without a match in JMdict.

ghost commented 2 years ago

@JMdictProject I will compare new changes more thoroughly later, but I just noticed one interesting thing For example for 作る、造る、創る entry, we have following Russian translations:

<sense>
<gloss xml:lang="rus">изготовлять</gloss>
<gloss xml:lang="rus">делать</gloss>
</sense>
<sense>
<gloss xml:lang="rus">строить</gloss>
<gloss xml:lang="rus">воздвигать</gloss>
<gloss xml:lang="rus">1) ((тж.) 造る) делать, изготовлять; создавать; творить</gloss>
<gloss xml:lang="rus">2) ((тж.) 造る) строить</gloss>
<gloss xml:lang="rus">3) формировать; организовывать; учреждать</gloss>
<gloss xml:lang="rus">4) писать (книгу и т. п.)</gloss>
<gloss xml:lang="rus">5) возделывать; выращивать</gloss>
<gloss xml:lang="rus">6) (перен.) создавать</gloss>
<gloss xml:lang="rus">7) готовить {еду}</gloss>
<gloss xml:lang="rus">8) (связ.) прикрашивать</gloss>
<gloss xml:lang="rus">9) придавать (какой-л.) вид</gloss>
<gloss xml:lang="rus">10) выдумывать</gloss>
</sense>

I wonder where these first four glosses came from? I grepped Russian EDICT file by воздвигать, and I couldn't find any matches with any of 作る、造る、創る . Seems like it should be a translation for 新築 or 新築する 🤔 It's similar meaning though, so it's not completely incorrect.

Same for word 食らう、喰らう

<sense>
<gloss xml:lang="rus">есть</gloss>
<gloss xml:lang="rus">пить</gloss>
</sense>
<sense>
<gloss xml:lang="rus">получать</gloss>
<gloss xml:lang="rus">претерпевать</gloss>
<gloss xml:lang="rus">(прост.)</gloss>
<gloss xml:lang="rus">1) есть, жрать; пить; (обр.) жить, существовать</gloss>
<gloss xml:lang="rus">2) (перен.) получить, претерпеть</gloss>
<gloss xml:lang="rus">(ср.) くらわす【食らわす】</gloss>
</sense>

First five glosses isn't included in 食らう definition in Russian EDICT

grep 食らう ewarodaiedict.txt 
食らう [くらう] /(прост.)/1) есть, жрать; пить; (обр.) жить, существовать/2) (перен.) получить, претерпеть/(ср.) くらわす【食らわす】/

Meaning of first five glosses here is also ok, but they probably shouldn't be there 🤔

Can you please check where those glosses came from?

PS: Before changes in import script, glosses for these words were the same, so it's not related to recent changes in import script.

JMdictProject commented 2 years ago

The EDICT-format version of Warodai has: 作る [つくる] /1) ((тж.) 造る) делать, изготовлять; создавать; творить/2) ((тж.) 造る) строить/3) формировать; организовывать; учреждать/4) писать (книгу и т. п.)/5) возделывать; выращивать/6) (перен.) создавать/7) готовить {еду}/8) (связ.) прикрашивать/9) придавать (какой-л.) вид/10) выдумывать/ 造る [つくる] /(1) изготовлять/делать/(2) строить/воздвигать/ and 食らう [くらう] /(прост.)/1) есть, жрать; пить; (обр.) жить, существовать/2) (перен.) получить, претерпеть/(ср.) くらわす【食らわす】/ 喰らう [くらう] /(1) есть/пить/(2) получать/претерпевать/

Those pairs of entries map onto single JMdict entries (作る,造る and 食らう,喰らう) so that's where the glosses are coming from. It's certainly not an ideal situation, but the fix should be done in Warodai itself.

ghost commented 2 years ago

@JMdictProject Hmm... 🤔 Are we using same version of warodai EDICT file? Because I just downloaded it and I cannot find those lines you posted

$ grep 造る ewarodaiedict.txt 
形作くる;形造くる;形作る;形造る;容作る [かたちづくる] /образовывать; составлять; иметь форму (чего-л.)/
作る [つくる] /1) ((тж.) 造る) делать, изготовлять; создавать; творить/2) ((тж.) 造る) строить/3) формировать; организовывать; учреждать/4) писать (книгу и т. п.)/5) возделывать; выращивать/6) (перен.) создавать/7) готовить [еду]/8) (связ.) прикрашивать/9) придавать (какой-л.) вид/10) выдумывать/

$ grep 喰らう ewarodaiedict.txt
# no matches

I downloaded it from this page https://warodai.ru/download/ But it says that EDICT2 file weren't updated recently (last update was in 2016)

Am I downloading it from wrong/different source than you?

JMdictProject commented 2 years ago

OK, after digging into files and records from 2016 I've worked out what I did back then. The J-R file I use in WWWJDIC and in the JMdict glosses is a combination of the ewarodaiedict file and entries from Oleg Volkov's 2005 JR-EDICT. Where that latter dictionary had entries that were not in ewarodaiedict I included them (there are about 2.400 entries in that category.) That's where 喰らう came from. In fact, there are only a few such duplicates from merging the files. There are quite a few odd "glosses" as a result of the way ewarodaiedict is organized, for example, 1000310 has a cross-reference as a gloss, and 1011960 has two similar glosses and two cross-references.

I'll see if I can reduce the duplication. There may be a way to detect and remove the cross-references as they are not needed. Probably best we conduct this discussion directly. Can you contact me at [jimbreen(at)gmail.com]

JMdictProject commented 2 years ago

This matter has been largely resolved. I think it can be closed as an issue.