adsabs / ADSIngestParser

Curation parser library
MIT License
0 stars 7 forks source link

Base parser should include code for entity normalization #106

Open seasidesparrow opened 3 months ago

seasidesparrow commented 3 months ago

Is your feature request related to a problem? Please describe. The ADS has a policy for converting some HTML entities to their ascii equivalent. For example, „ should be converted to ascii double quotes (") as part of record normalization. This can and should happen at parse time, because it's something that's done for all incoming records.

The old ingest parser had code to do this work here: https://github.com/adsabs/adsabs-pyingest/blob/master/pyingest/parsers/entity_convert.py

Describe the solution you'd like A generic entity converter should be implemented in base parser, so that all fields are subject to normalization at parse time.

Additional context Add any other context or screenshots about the feature request here.