eregs / regulations-parser

Parser for U.S. federal regulations and other regulatory information
Creative Commons Zero v1.0 Universal
36 stars 40 forks source link

37 CFR 1 - FR correction notice interpreted as modifying non-existent appendix #380

Open gregoryfoster opened 7 years ago

gregoryfoster commented 7 years ago

Dev environment: current master [ b2a4c07 ] + PR #378

To reproduce the warning:

eregs clear
eregs preprocess_notice E7-19326
eregs write_to output

This results in the output:

... regparser.notice.amendments.appendix     Could not find Appendix E7 to part 1
... regparser.notice.xml                     Unable to fetch amendments for docket E7-19326

This warning occurs when processing 72 FR 55055 amendment 1 at regparser/notice/amendments/appendix.py:31. From what I can tell, the notice is interpreted as amending a non-existent E7 appendix in 37 CFR 1. The parser appears to be deriving the appendix identifier from the FR document ID (E7-19326). Can you confirm and recommend an approach here?

cmc333333 commented 7 years ago

Hey @gregoryfoster, looks like this is being triggered by the first AMDPAR in that notice (which isn't rendering properly on the federalregister's site). If you look at the XML, you'll see

<AMDPAR>
In rule FR Doc. E7-16574, August 22, 2007 (72 FR 46899), make the following corrections:
</AMDPAR>

We can pop that into the amdparser to see how it's read:

In [1]: from lxml import etree

In [2]: from regparser.notice.amdparser import parse_amdpar

In [3]: parse_amdpar(etree.fromstring('<AMDPAR>In rule FR Doc. E7-16574, August 22, 2007 (72 FR 46899), make the following corrections:</AMDPAR>'), [])
Out[3]:
(<Element EREGS_INSTRUCTIONS at 0x7fc67d62af88>,
 [None, 'Appendix:E7', '16574'])

You'll notice that the second value (the resulting "context") points to section 16574 of appendix E7 (clearly not correct). If we dig into the amdparser (specifically regparser.grammar.amdpar:appendix_section -> regparser.grammar.unified:appendix_with_section -> regparser.grammar.atomic.appendix_digit) we can see why. E7 could be an appendix, and 16574 could be a section within an appendix (according to the current rules).

I think the section number is the bit that makes the most sense to twiddle here; let's modify appendix_digit to only accept 1-4 character sections. While I've seen 2-digit appendix sections and can imagine 4-digit ones, 5 seems excessive. We can probably get away with a Regex parser that respects word-boundaries but only allows 1-4 characters. Let us know if that's not enough to get you started!