adsabs / ADSImportPipeline

Data ingest pipeline for ADS classic->ADS+
GNU General Public License v3.0
1 stars 12 forks source link

Verify proper encoding of entities in direct/classic ingest #254

Open aaccomazzi opened 3 years ago

aaccomazzi commented 3 years ago

An issue surfaced with the encoding of the basic < XML entity which may have been caused by direct ingest (or a bug in classic ingest). One such example is for the bibcode 2020arXiv201000466H. When properly encoded, the abstract should have the following content:

We use data from the DESI Legacy Survey imaging to probe the galaxy density field in tomographic slices covering the redshift range $0<z<0.8$.

Rather than:

We use data from the DESI Legacy Survey imaging to probe the galaxy density field in tomographic slices covering the redshift range $0<z<0.8$.
aaccomazzi commented 3 years ago

After checking how we encode data in the SOLR, I believe this is the list of fields for which we need to escape the basic XML entities ("<", ">", and "&"):

The code for this should be as simple as:

def escape( str ):
    str = str.replace("&", "&amp;")
    str = str.replace("<", "&lt;")
    str = str.replace(">", "&gt;")
    return str