JMdictProject / JMdictIssues

JMdict Japanese dictionary - lexicographic, etc. issues management
16 stars 1 forks source link

Identifying the oldest EDICT entries that have gone unedited for 25+ years #59

Open Marcusjmdict opened 2 years ago

Marcusjmdict commented 2 years ago

Going through Meikyo vs/vt multi-sense list, I've noticed that quite a few older entries in JMdict have glosses of what should obviously be different senses all jumbled together. In a comment on the entry for 書院, Jim mentioned that the feature of different senses in a single entry only arrived "several years into the project." Would it be possible to somehow identify entries that were added prior to this that haven't been edited even once since then? If we could produce a list of those and put it online, it'd be easy to go through it using yomichan etc. to find entries that are in need of a sense split.

JMdictProject commented 2 years ago

I'm not sure it's possible to do much in that area which would be very productive. Sense tagging emerged in EDICT in the mid-90s as "(2) xxxx yyy", etc, and these were turned into elements when it was converted to JMdict format in 1999. I added a date to new entries in 2003, and the database system changes that whenever it is edited, We changed to the online database in mid-2010.

I think a better approach would be to reverse the process. Find a good source with senses marked, e.g. Meikyo, and run a match of its multisense entries against JMdict to get a list of entries that only have one sense.

Marcusjmdict commented 2 years ago

It would be good to have a list of all entries that haven't been edited for XX years for quality control reasons too, actually.

JMdictProject commented 2 years ago

As you can see at the bottom of the display of each entry in the database system, there I the option of getting an expanded entry which contains the date-stamped log of comments and references. I should be able to use that information to get a list of entries without edits after a specific date. It will probably take me a couple of weeks to get to it as I'm away from home for a while.

JMdictProject commented 2 years ago

OK, I've done a quick analysis of date-stamps on entries.

I could create online lists of these but they may not be a lot of use.

Marcusjmdict commented 2 years ago

Might it be possible to add some type of time stamp to the 47k entries without dates to make them easily searchable? Maybe 2003/12/2 (the day before the earliest time-stamped entries in the database). Or could there be a way to make the advanced search function treat any entry without a date stamp as being older than 2003?

JMdictProject commented 2 years ago

I could probably run the bulk update over 47k entries adding an identifiable string to the comments, which then could be used as a search key. They would then be regarded as having been edited. The downside is that they'd keep that string even if they are updated.

Another possibility is that I can use the list of the 47k sequence numbers to generate WWW pages with a summary of each entry and a link to the database entry. If I did pages of 100 entries there's be 470 of them. It's all a bit indigestible.

Here's a small sample from the 47k.

1033660 : オールドタイマー (n) old-timer 1033690 : オールドファッション (n) old-fashioned [in GG5, probably adj-no] 1033730 : オールナイト (n) all-night [GG5, etc.] 1033750 : オールパーパス (n) all-purpose 2013710 : 道州制 (どうしゅうせい) (n) administrative reform proposal, involving integration of prefectures into 7 or 9 states [GG5, ルミナス]