arbeitsgruppe-digitale-altnordistik / Sammlung-Toole

A new look on Handrit.is data
https://arbeitsgruppe-digitale-altnordistik.github.io/Sammlung-Toole/
MIT License
0 stars 0 forks source link

loading data causes lots of warnings #78

Closed BalduinLandolt closed 2 years ago

BalduinLandolt commented 2 years ago

there seems to be an issue with loading the XMLs by handrit.

it looks roughly like this:

2021-11-01 19:11:44,540 [ util.tamer ] - ERROR:   Faild to load Shelfmark XML:

<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="https://handrit.is/schema/handrit-2.0.rnc" type="compact"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
...
</TEI>

Traceback (most recent call last):
  File ".\util\tamer.py", line 50, in _get_shelfmark
    root = etree.fromstring(content.encode())
  File "src\lxml\etree.pyx", line 3237, in lxml.etree.fromstring
  File "src\lxml\parser.pxi", line 1896, in lxml.etree._parseMemoryDocument
  File "src\lxml\parser.pxi", line 1784, in lxml.etree._parseDoc
  File "src\lxml\parser.pxi", line 1141, in lxml.etree._BaseParser._parseDoc
  File "src\lxml\parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 654, in lxml.etree._raiseParseError
  File "<string>", line 257
lxml.etree.XMLSyntaxError: ID IB04-0250-NULL already defined, line 257, column 41

steps to reproduce:

BalduinLandolt commented 2 years ago

note that lxml seems to have an option XMLParser(recover=True) which might help. (See e.g. here)