MrTango / rispy

Python RIS files parser, provides RIS files as dictionary via generator.
MIT License
63 stars 18 forks source link

Handling tags with empty values #62

Open holub008 opened 3 months ago

holub008 commented 3 months ago

Back with another spec corner case-- the below truncated example comes from our friends at Embase:

import rispy

test_ris_str = """TY  - JOUR
ID  - 2006713348
T1  - Outcome Measures After Shoulder Stabilization in the Athletic Population: A Systematic Review of Clinical and Patient-Reported Metrics
A1  - Fanning E.
Y1  - 2020//
N2  - Background: Athletic endeavor can require the "athletic shoulder" to tolerate significant load through supraphysiological range and often under considerable repetition. 
Outcome measures are valuable when determining an athlete's safe return to sport...
KW  - *athlete
KW  - biomechanics
KW  - bone remodeling
JF  - Orthopaedic Journal of Sports Medicine
JA  - Orthop. J. Sports Med.
VL  - 8
IS  - 9
SP  -
PB  - SAGE Publications Ltd (E-mail: info@sagepub.co.uk)
SN  - 2325-9671 (electronic)
DO  - http://dx.doi.org/10.1177/2325967120950040
ER  -"""

out = rispy.loads(test_ris_str)
out[0]['number']  # '9 SP  -'

As you can see, the empty SP - tag is detected as a wrap of the IS tag, which is not what the RIS writer intended.

Any thoughts on recognizing (and most probably discarding) empty tags like SP here?

It's difficult because detecting & keeping line wrap is extremely useful (see in this same record, with the abstract in N2 being wrapped), and it's possible, though relatively, unlikely that a legitimate wrapped line could conflict with the RIS tag format.

holub008 commented 3 months ago

This is trivially fixed by making the space following the - after each tag optional, as follows:

class Issue62Override(rispy.RisParser):
    PATTERN = r"^[A-Z][A-Z0-9]  - ?|^ER  -\s*$"

out = rispy.loads(test_ris_str, implementation=Issue62Override)
out[0]['number']  # '9'

So the question is really if there's interest in this being universal in RisParser. For what it's worth, we maintain an internal test suite of ~30 files from various providers and this change broke none of our assertions while correcting the issue. I haven't run against rispy's test suite though.

shapiromatron commented 3 months ago

I like this idea, would accept a PR if you have time @holub008!