PerseusDL / canonical-greekLit

XML Canonical resources for Greek Literature
https://scaife.perseus.org
Creative Commons Attribution Share Alike 4.0 International
102 stars 95 forks source link

Unicode consistency #1412

Open gcelano opened 1 year ago

gcelano commented 1 year ago

There are a few texts (see below) whose Greek does not correspond to their normalized form ("NFC"). In many cases, for example, similar characters come either from the Unicode block "Greek Extended" or from the Unicode block "Greek and Coptic" (usually without being possible to distinguish them visually). I'm wondering whether one may want to check (future) texts for Unicode normalization before adding them to the repository. This discrepancy can potentially harm further automatic processing of the texts. (This is a short script that could handle this issue, if needed)

(partial list based on an older release)

tlg0033.tlg003.perseus-grc2/tlg0033.tlg003.perseus-grc2.xml tlg3135.tlg001.opp-grc3/tlg3135.tlg001.opp-grc3.xml tlg0031.tlg027.perseus-grc2/tlg0031.tlg027.perseus-grc2.xml tlg0011.tlg002.perseus-grc2/tlg0011.tlg002.perseus-grc2.xml tlg0008.tlg001.perseus-grc4/tlg0008.tlg001.perseus-grc4.xml tlg0551.tlg009.perseus-grc2/tlg0551.tlg009.perseus-grc2.xml tlg0086.tlg034.digicorpus-grc1/tlg0086.tlg034.digicorpus-grc1.xml tlg0551.tlg014.perseus-grc2/tlg0551.tlg014.perseus-grc2.xml tlg0033.tlg002.perseus-grc2/tlg0033.tlg002.perseus-grc2.xml tlg0090.tlg001.opp-grc1/tlg0090.tlg001.opp-grc1.xml

gregorycrane commented 1 month ago

I have addressed the particular issue with using precomposed Greek for Aristotle's Poetics. I ran the conversion over the whole file. The results are stored here:

https://github.com/gregorycrane/Poetics2.0/blob/main/grc/tlg0086.tlg034.digicorpus-grc2.xml

I think it more efficient if @lcerrato and/or @AlisonBabeu check this file for any obvious issues (it parses and seems fine to me) and then use this to replace the current file.