adsabs / ADSIngestParser

Curation parser library
MIT License
0 stars 7 forks source link

Some additional email addresses from MDPI/JATS could be captured with special handling #116

Open seasidesparrow opened 4 months ago

seasidesparrow commented 4 months ago

Is your feature request related to a problem? Please describe. MDPI JATS files occasionally include email addresses in the affiliation text body that are tagged in text with author initials. With the current parser, these are being stripped, rather than parsed out and added as author attributes. In the XML, they aren't being properly tagged with a specific author id, so barring parsing of initials, they would have to all be assigned to any author having that affiliation string

Describe the solution you'd like We want to capture the email addresses as part of the record. It is probably(?) too complicated to do intelligent parsing (e.g. with author initials), but at a minimum the email addresses could be included as part of the affiliation itself, or parsed out as one or more email addresses that can be assigned to all authors with that affiliation.

Additional context For an example of inputs and (current) outputs, see https://github.com/seasidesparrow/ADSIngestParser/blob/6411b00f01831df4617f08d394347d11dee63bf3/tests/stubdata/input/mdpi_symmetry-15-00939.xml#L78 and https://github.com/seasidesparrow/ADSIngestParser/blob/main/tests/stubdata/output/mdpi_symmetry-15-00939.json