CDLUC3 / mrt-doc

Documentation and Information regarding the Merritt repository
8 stars 4 forks source link

ETDs: Encoding error has occurred while attempting to parse an ETD zip #1919

Closed elopatin-uc3 closed 4 months ago

elopatin-uc3 commented 4 months ago
Traceback (most recent call last):
  File "/etds/apps/uc3-etds/scripts/getetds.py", line 334, in <module>
    main()
  File "/etds/apps/uc3-etds/scripts/getetds.py", line 322, in main
    (merritt_pm_list, pq_md_list) = extract_metadata(hostenv)
  File "/etds/apps/uc3-etds/scripts/getetds.py", line 191, in extract_metadata
    dom = ET.fromstring(pqstring)
  File "src/lxml/etree.pyx", line 3222, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1765, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
  File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
  File "<string>", line 90
lxml.etree.XMLSyntaxError: Char 0xFFFE out of allowed range, line 90, column 174
elopatin-uc3 commented 4 months ago

160+ ETDs are waiting to be processed in the zipfiles directory

elopatin-uc3 commented 4 months ago

Issue found with specific zip by adding debug statement in getetds.py

elopatin-uc3 commented 4 months ago

Removed problematic character from ETD XML abstract.

Now running getetds.py to ingest all waiting ETDs in the zipfiles directory.

elopatin-uc3 commented 4 months ago

Script run completed successfully