eregs / regulations-parser

Parser for U.S. federal regulations and other regulatory information
Creative Commons Zero v1.0 Universal
37 stars 39 forks source link

Fails If Depth4 Marker is not uppercase alphabetic #390

Open Efferon opened 6 years ago

Efferon commented 6 years ago

PARSER DEPTH finding -doesn't account for different outlining schemes.

See 21 CFR 113.40(a)(1)(i)(a) [https://www.gpo.gov/fdsys/pkg/CFR-2001-title21-vol2/xml/CFR-2001-title21-vol2-part113.xml] as an example of this issue.

The process to identify the depth of the citation outline makes the assumption that Level_1 is lowercase, Level_2 is numerical, Level_3 is lowercase roman numerals, Level_4 is uppercase. In many of the older FDA regulations for Level _4 the regulation uses lowercase italicized rather than the current standard for uppercase-non-italicized. As a result the parser will error out and break.

A work around is to convert the lowercase citations to uppercase in the input xml file, but technically the citation 21 CFR 113.40(a)(1)(i)(a) is not the same as 21 CFR 113.40(a)(1)(i)(A). Further changing the citation is, in fact, likely to confuse people as its not what was printed in the Federal Register.