OP-TED / ted-rdf-conversion-pipeline

TED Semantic Web Services
Apache License 2.0
5 stars 5 forks source link

Handle parsing issues in METs packages #549

Open cristianvasquez opened 3 weeks ago

cristianvasquez commented 3 weeks ago

Some METs packages have been reported to fail parsing due to issues with their contents. The causes identified are:

  1. Character encoding issues

image

 org.xml.sax.SAXParseException; lineNumber: 24; columnNumber: 48; The entity name must immediately follow the '&' in the entity reference.
  1. It is not allowed to have HTML markup in the title text

image

cristianvasquez commented 1 week ago

Apparently this is to escape the contents in the XML jinja template through operators:

https://tedboy.github.io/jinja2/templ10.html

For instance, https://github.com/OP-TED/ted-rdf-conversion-pipeline/blob/01673c0b8dc93e05740cf7d989365f0f13a7b9f8/ted_sws/notice_packager/resources/templates/mets_xml_dmd_rdf.jinja2#L24

becomes

        <cdm:work_title xml:lang="{{ lang }}">{{ work.title[lang]| e }}</cdm:work_title>