digitalpalidictionary / dpd-db

12 stars 7 forks source link

Spurious keys in MDict data #12

Closed falko-strenzke closed 8 months ago

falko-strenzke commented 8 months ago

In the August 2023 release of MDict there is a key (headword, i.e. entry with HTML content):

the form of the prefix <i>adhi</i>- before all vowels except ī

Probably that is not meant to be a key.

bdhrs commented 8 months ago

Strangely it is actually a key, auto-generated for the English to Pāḷi dictionary. See "ajjh".

falko-strenzke commented 8 months ago

The problem with keys that contain HTML tags occurs when I straightforwardly use them in the URL because of the "/" character. Maybe it generally makes sense to remove the HTML tags from the export at some point, as they most likely make the lookup more difficult rather than easier.

Until that point I will continue to filter them, I don't think we are losing much here. At some point I could of course also cut out the HTML tags during parsing.