adsabs / ADSIngestParser

Curation parser library
MIT License
0 stars 7 forks source link

Do not unescape < and > in XML output #126

Closed ehenneken closed 2 weeks ago

ehenneken commented 3 weeks ago

Describe the bug Reference data has been created where < and > in the original data were unescaped into < and >, making the XML invalid.

To Reproduce Example:

grep sref18 /proj/ads/references/sources/ACCR/0014/2023ACCR...14..691W.elsevier.xml 
<sb:reference id="sref18"><sb:contribution langtype="en"><sb:authors><sb:author><ce:given-name>I.G.</ce:given-name><ce:surname>Rigor</ce:surname></sb:author><sb:author><ce:given-name>J.M.</ce:given-name><ce:surname>Wallace</ce:surname></sb:author><sb:author><ce:given-name>R.L.</ce:given-name><ce:surname>Colony</ce:surname></sb:author></sb:authors><sb:title><sb:maintitle>Response of sea ice to the Arctic oscillation</sb:maintitle></sb:title></sb:contribution><sb:host><sb:issue><sb:series><sb:title><sb:maintitle>J. Clim.</sb:maintitle></sb:title><sb:volume-nr>15</sb:volume-nr></sb:series><sb:date>2002</sb:date></sb:issue><sb:pages><sb:first-page>2648</sb:first-page><sb:last-page>2663</sb:last-page></sb:pages><ce:doi>10.1175/1520-0442(2002)015<2648:ROSITT>2.0.CO;2</ce:doi></sb:host></sb:reference>

Additional context The html.unescape() function causes this behavior

seasidesparrow commented 3 weeks ago

I believe we currently only put raw xml from the input document in the references field. One simple fix is to modify base parser's '_entity_convert' method to ignore self.base_metadata["references"].

Buf if, in the future, we start to use the fulltext field in ingest_data_model, we may have to think about this more.

@ehenneken have you noticed whether we've parsed documents themselves that have <> in the DOI, and if so is it going into the document's "doi" field correctly?

seasidesparrow commented 3 weeks ago

In base.py we define the IngestBase class (https://github.com/adsabs/ADSIngestParser/blob/8e1796d0e6490aff99c6bb8dccbfaf93d29a130d/adsingestp/parsers/base.py#L10) with no instance variables set. We could add an instance variable like self.xml_ref here that is by default set to True. The method _entity_convert could then check this variable, and convert (or not) the key-value pair for "references". If we make this change and set default to True, then references (only) will be output in the ingest_data_model in xml-compliant format.

If and when we use a different system for processing references following parsing, we can simply instantiate parsers relying on IngestBase with IngestBase(xml_ref=False)