Closed ehenneken closed 3 months ago
I believe we currently only put raw xml from the input document in the references field. One simple fix is to modify base parser's '_entity_convert' method to ignore self.base_metadata["references"].
Buf if, in the future, we start to use the fulltext field in ingest_data_model, we may have to think about this more.
@ehenneken have you noticed whether we've parsed documents themselves that have <>
in the DOI, and if so is it going into the document's "doi" field correctly?
In base.py
we define the IngestBase class (https://github.com/adsabs/ADSIngestParser/blob/8e1796d0e6490aff99c6bb8dccbfaf93d29a130d/adsingestp/parsers/base.py#L10) with no instance variables set. We could add an instance variable like self.xml_ref
here that is by default set to True
. The method _entity_convert could then check this variable, and convert (or not) the key-value pair for "references". If we make this change and set default to True, then references (only) will be output in the ingest_data_model in xml-compliant format.
If and when we use a different system for processing references following parsing, we can simply instantiate parsers relying on IngestBase with IngestBase(xml_ref=False)
Describe the bug Reference data has been created where < and > in the original data were unescaped into < and >, making the XML invalid.
To Reproduce Example:
Additional context The
html.unescape()
function causes this behavior