JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
18 stars 1 forks source link

Balanced Corpus of Contemporary Written Japanese #85

Open stephenmk opened 1 year ago

stephenmk commented 1 year ago

I suspect this has probably been discussed before, but a search through the mail archive didn't turn up anything. Apologies if I missed it.

The National Institute for Japanese Language and Linguistics published a frequency analysis of its "Balanced Corpus of Contemporary Written Japanese" (BCCWJ) under a creative commons BY-NC-ND license. What's particularly interesting about it to me is that its word list includes the dictionary forms of words, their readings, and detailed part-of-speech information.

There was a little bit of discussion recently about making entries in the name dictionary more useful by adding frequency information. I wonder if the BCCWJ list could be used for this purpose.

bccwj

So for example, the list contains three entries for 大谷: as a surname (おおや), as a place name (おおや), and as a general proper noun (おおたに). In JMnedict, we have six different readings for 大谷 (おおがい;おおたに;おおだに;おおや;おたに;だいたに) tagged as surnames and place names. Perhaps the BCCWJ data could be used to prioritize the more common readings.

I'm not sure if using the data this way would be a violation of the "no-derivatives" aspect of the data's license, though. If not, does this sound like something worthwhile?

JMdictProject commented 1 year ago

I don't think there's been any discussion in JMnedict circles about the possible use of the BCCWJ data for indicating frequencies. It's a great resource and it would be useful to tag JMnedict entries in some way to reflect their inclusion there.

It's always been a challenge to use the mass of (often low-quality) entries in JMnedict in a meaningful way. In WWWJDIC I use a modified format where I aggregate all the readings/translations for a kanji form into a single "entry". As part of this, I use some rather old frequency information to concentrate the more common readings to the front. As an example of this, for the 大谷 entry, I put the おおたに, おおや and おおがい first. This fiddle is done for about 80k forms. It seems to work in that context, both for regular lookups and text glossing.

stephenmk commented 1 year ago

OK, I'm working on correlating the BCCWJ frequency data to JMnedict entries. Should have some data to share soon.

stephenmk commented 1 year ago

@JMdictProject

I found 24 duplicate entries in JMnedict that can be deleted. The below CSV file lists the sequence numbers of the entries to be deleted in the first column. They don't contain any extra information that needs to be merged elsewhere.

https://gist.github.com/stephenmk/751c1c0095752226c7bc037ea1e83e6a

JMdictProject commented 1 year ago

I found 24 duplicate entries in JMnedict

Thanks. I've merged/deleted as appropriate.

stephenmk commented 1 year ago

Here is a CSV of ~46k JMnedict headwords and sequence numbers that I've identified as proper nouns included in the BCCWJ Word Lists (short unit and long unit v2).

I correlated the BCCWJ terms with JMnedict entries by matching their kanji forms and hiragana-normalized readings. For a word like アフリカ which has a kanji form in JMnedict but does not have a kanji form in the BCCWJ word list, I allowed it to match if there only existed one kanji form for the reading in JMnedict. So アフリカ matched with 阿弗利加【 アフリカ】, but the name いのくら didn't find a match because there are seven different kanji forms in JMnedict. Of course, if we had an entry for いのくら without a kanji form, then it would have matched to that.

I also tried removing the nakaguro characters from names in the list if matches could not be found with the nakaguro. So for example, ラスパルマス and ラス・パルマス ("Las Palmas") records in the word list both matched to JMnedict entry 5089163, ラスパルマス.

I only matched on proper nouns, which had part-of-speech tags prefixed with "名詞-固有名詞-" in the word list. I included these tags in the attached data with the prefix removed.

There are only a tiny handful of headwords in JMnedict which may be found in more than one entry. E.g. 燕【 えん】 has a [surname] entry and a [place] entry. In these cases I used part-of-speech tags to find the right match.


Now the question of how to use the data. I think there are generally two approaches we could take to tagging these names in JMnedict. We could just make one general "BCCWJ" tag to indicate that a form is included in the corpus, or we could also use the frequency information in the corpus to make a set of frequency-style tags similar to the news tags used in JMdict (n01, n02, etc.).

Right now I'm thinking the simpler approach is the better option. A simple "BCCWJ" tag should be enough to distinguish common name readings from uncommon name readings in most cases. Also, I'm not certain if the relative frequencies in the corpus are reliable enough to provide useful information.

Take for example the given name 「哲」. This name is the only given name with four distinct readings in the BCCWJ word list, and it is also the given name with the most distinct readings. Here's the frequency data for 哲 in the long-unit-word (LUW, v2) version of the corpus.

Reading BCCWJ
Counts
あきら 69
さとる 3
さとし 3
てつ 3

So judging by this data, we might expect あきら to be far more common than any other reading. However, out of the first 40 search results for「哲」 on Japanese Wikipedia, I counted the following usages.

Reading BCCWJ
Counts
JA Wiki
Counts
あきら 69 2
さとる 3 3
さとし 3 11
てつ 3 18

It's odd that there's such a large discrepancy. I checked some other names and found similar discrepancies that make me question the usefulness of the relative frequency statistics from the BCCWJ corpus. For another example, IEEE has one count for アイトリプルイー and 137 counts for アイイーイーイー in the short-unit-word list. The page for IEEE on Japanese Wikipedia lists the former reading and doesn't even mention the latter.

Despite that discrepancy, I think the BCCWJ data is still useful. JMnedict currently contains seven other readings for 哲: あき, ちょる, てつじ, とおる, ひろし, まさる, and ゆたか. So by simply adding a "BCCWJ" tag to the four forms listed in the corpus, I think that's already a huge improvement in the quality of the information.

To drive the point home, here's a comparison of distinct readings from a sample of names.

Proper
Noun
BCCWJ
Readings
JMnedict
Readings
豊栄 3 7
3 37
江南 3 8
3 29
3 18
大平 3 15
塩谷 3 8
麻里 2 5
麻生 2 24
2 15
鹿野 2 7
2 12
高麗 2 11

So those are just my current thoughts about the situation. I'm looking forward to hearing other perspectives.

JMdictProject commented 1 year ago

Thanks for all this. It looks very interesting.

I'm thinking of adding an entry-wide priority element a bit like the , etc. in JMdict. It could have a value such as "bccwj", and I could possibly have one for the frequency information I use to order readings in WWWJDIC.

I'm afraid none of this is going to happen quickly - there are a lot of other things in the work queue.

stephenmk commented 1 year ago

No problem. I can refresh the list with the latest JMnedict data whenever it's needed.