PerseusDL / lexica

Repo for the text files of lexica
Creative Commons Attribution Share Alike 4.0 International
53 stars 23 forks source link

Unable to parse due entity not defined errors #31

Closed ids1024 closed 7 years ago

ids1024 commented 7 years ago

Entities are an aspect of XML I am not very familiar with, so I may be doing something wrong.

Anyway, when I try to run lxml.etree.parse("lat.ls.perseus-eng1.xml") in Python to parse the XML file for Lewis and Short, I get this error:

lxml.etree.XMLSyntaxError: Entity 'dagger' not defined, line 288, column 27
lcerrato commented 7 years ago

Hi, You either need to declare the entity within the xml file, point to an external DTD (this file points to http://www.perseus.tufts.edu/DTD/1.0/PersDict.dtd which itself refers to other DTDs) or convert to Unicode -- which is what we are doing with other texts as documented here: https://github.com/PerseusDL/tei-conversion-tools/wiki/HTML-Entities The above link also discusses DTDs. For this particular entity, the dagger is U+2020

ids1024 commented 7 years ago

Thanks for your help. Apparently network access is disabled in lxml by default (which is sensible enough, though not very prominent in the documentation). Creating a parser with no_network=False makes it retrieve the needed DTD files and parse without error.