PerseusDL / lexica

Repo for the text files of lexica
Creative Commons Attribution Share Alike 4.0 International
53 stars 23 forks source link

Lewis and Short: carets in many entries; misreading of double diacritic mark #41

Open ids1024 opened 7 years ago

ids1024 commented 7 years ago

For example academia appears as ăcădēmī^a.

academia

I noticed this running a script to test if the key matches the first word of the entry (after stripping accents, etc.). It should be possible to write a script that fixes this; I don't think there should be (m)any false positives.

lcerrato commented 7 years ago

Hi. The lexica were encoded prior to there being Unicode equivalents for the various accents and stress marks so there are various idiosyncrachies throughout. The caret was intentional as a stopgap. Neither of the large lexica have been converted to Unicode yet.

ids1024 commented 7 years ago

Fair enough. Assuming the caret always represents the same thing, it shouldn't be too hard to automatically script this.

Would a PR applying such a change be accepted, or are there issues to address first, etc?

lcerrato commented 7 years ago

All of the entities need conversion to Unicode and there are probably similar encoding gaps, etc. Then the markup itself needs conversion. The primary sources are the priority at this writing. As noted in the wiki, none of the changes made here will be visible in Perseus in the short term, but we have been accepting pull requests on these versions.