adsabs / ADSIngestParser

Curation parser library
MIT License
0 stars 7 forks source link

Generalize the ArXiv Parser to cover all dublin core objects #43

Closed seasidesparrow closed 8 months ago

seasidesparrow commented 1 year ago

Is your feature request related to a problem? Please describe. ADS get DublinCore records from other sources, for example from Proceedings of Science (SISSA.it). The existing arxiv.ArxivParser can parse them, but it adds fields and data handling that are specific to ArXiv records.

Describe the solution you'd like We need a lower-level DublinCore parser capable of parsing, but not augmenting, any DublinCore-type record. The ArXiv parser should inherit from that, and provide additional parsing and formatting options specific to ArXiv records.

Additional context The existing arxiv.ArxivParser really contains the essence of what you'd need in a general Dublin core parser, and can already provide sufficient detail to create classic records as-is; if you use ADSManualParser to parse the dubcore record in /proj/ads_abstracts/sources/PoS/oai/pos.sissa.it/ECRS/002, you get a well-formatted record, but with a Publication field (%J) crafted specifically for ArXiv records:

%R 2023TEST..........W
%T The Memories of the First European Cosmic Ray Symposium: Łódź 1968
%A Watson, Alan
%F AA()
%D 2023/02
%J eprint arXiv:ECRS/002
%K Astroparticle Physics
%B The origins of the series of European Cosmic-Ray Symposia are briefly described. The first meeting in the series, on ‘Hadronic Interactions and Extensive Air Showers’, held in Łódź, Poland in 1968, was attended by the author: some memories are recounted.

The code in parsers.arxiv.ArxivParser could be moved to a general parser (e.g. parsers.dubcore.DublinCoreParser) with the ArXiv-specific augmentations moved to a new ArXivParser that inherits from DublinCoreParser.

mugdhapolimera commented 8 months ago

https://github.com/adsabs/ADSIngestParser/pull/84