digitalpalidictionary / dpd-db

12 stars 7 forks source link

Duplicate entries for headwords #11

Closed falko-strenzke closed 1 week ago

falko-strenzke commented 8 months ago

While working on the MDict processing, I realized that the MDict of this release contains a number of duplicate entries for headwords (i.e. entries that contain HTMl and not "@@@LINK="):

"1st": "2nd": "3rd": "acc": "act": "agent": "arahant": "base": "care": "evamādi": "family": "go": "hare": "he": "lit": "noun": "prefix": "pulavaka": "sad": "sandhi": "suffix": "sādayati": "sāriputtamoggalānā": "vayabhiññā": "ve": "vekacaraṃ": "verb": "yo": "āvedhita": "√sadh":

bdhrs commented 8 months ago

Yes, the duplicate entries are a known bug in the MDict exporter. Is that a complete list or a sample?

falko-strenzke commented 8 months ago

Yes, that should be a complete list since it is the output from the readmydict.py tool that goes through all the entries and where I implemented the check for duplicates.

bdhrs commented 8 months ago

I see those are a different kind of duplicate entry, where the same headword occurs in different sections of DPD, for example "family" is in Help (green colour) and English-to-Pāḷi (purple).

falko-strenzke commented 8 months ago

but as a reminder: "all the entries" here refers to only the headwords (as explained in the initial report above)