JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
16 stars 1 forks source link

Messy headword sections that can be tidied with [sK] tags #102

Closed stephenmk closed 2 months ago

stephenmk commented 9 months ago

I've been working on a distribution of JMdict that organizes the kanji forms from entries with multiple readings into a table.

When there are many long forms in an entry, this table flies off the screen.

Example: entry for 捕らぬ狸の皮算用 ![hatch](https://github.com/JMdictProject/JMdictIssues/assets/8003332/00056ff4-fd86-4145-8a8c-6aecf7c08802)

Since we're no longer enforcing reading restrictions due to hiragana/katakana differences, and since we now have search-only tags that we can use, all of these entries can be drastically simplified. I think this will be helpful to all apps which use JMdict data.

As a simple criteria for finding these sorts of entries, I looked at the sum of characters in the top header row of my tables. Below is a list of entries where the sum was greater than 30. I thought I'd share it here in case JMdict editors would like to help work through them.

sequence character count sum
2074250 34
2589380 34
2690580 34
2170750 33
2791100 33
2841834 33
1777600 32

Reviewed Entries

sequence character count sum
~2102990~ 86
~1514120~ 72
~2836087~ 65
~1608470~ 58
~2784340~ 57
~1589980~ 50
~1603480~ 50
~2159000~ 50
~2836339~ 50
~2237240~ 49
~1595900~ 48
~2684060~ 48
~1961830~ 48
~2850246~ 46
~2417440~ 44
~2714130~ 44
~1855790~ 42
~2373400~ 42
~2850280~ 42
~2251580~ 41
~1850370~ 39
~2064850~ 39
~2018320~ 39
~1571720~ 38
~2059950~ 38
~2848157~ 38
~1601770~ 37
~2745190~ 37
~2779470~ 37
~2847494~ 37
~1463690~ 36
~1872660~ 36
~2255200~ 36
~2520960~ 36
~2578560~ 36
~2719710~ 36
~2850274~ 36
~1342030~ 34
~1387250~ 34
~1597510~ 34
~1640500~ 34
~1783420~ 34
~2394150~ 34
~1399890~ 33
~2083740~ 33
~2275120~ 33
~2420070~ 33
~2202560~ 32
~2398920~ 32
~2664780~ 32
~2826504~ 32
~2829858~ 32
~1457250~ 31
~1796330~ 31
~2040040~ 31
~2083280~ 31
~2120210~ 31
~2124980~ 31
~2420190~ 31
~2756410~ 31
~2836605~ 31
~2845765~ 31
JMdictProject commented 2 months ago

I think this batch has all been cleaned up.