adsabs / ADSIngestParser

Curation parser library
MIT License
0 stars 7 forks source link

ELSEVIER: subscripts/superscripts not parsing #59

Closed csgrant00 closed 5 months ago

csgrant00 commented 1 year ago

title, keywords and abstract

python run.py -p "/proj/ads/abstracts/data/ELS/CONSYN.AST/ELS.080723/0016-7037/S0016703723X00155/S0016703723003332/S0016703723003332.xml" -t elsevier -f elsevier.test

or fractions

python run.py -p "/proj/ads/abstracts/data/ELS/CONSYN.AST/ELS.080723/0012-821X/S0012821X23X00181/S0012821X23003242/S0012821X23003242.xml" -t elsevier -f elsevier.test

seasidesparrow commented 1 year ago

This is an ADSIngestParser issue. Text is being extracted from tagged xml using the .get_text() function, which only returns the text contained within tags. We need the elsevier parser to use something similar to _detag where we can select what tags are allowed.