eregs / regulations-parser

Parser for U.S. federal regulations and other regulatory information
Creative Commons Zero v1.0 Universal
36 stars 40 forks source link

37 CFR 1 - notice amendment generates unexpected instruction label #369

Open gregoryfoster opened 7 years ago

gregoryfoster commented 7 years ago

Dev environment: current Master [ 89535d0 ] + fr-notices PRs 7 & 8.

Running eregs --debug fill_with_rules 37 1 results in a failure when processing 68 FR 48286 amendment 19 at regparser/notice/amendments/appendix.py:16. This failure was previously encountered by @meriouma and a workaround implemented here.

Here's the code which fails when trying to access an invalid index in label_parts.

def content_for_appendix(instruction_xml):
    label_parts, amdpar = label_amdpar_from(instruction_xml)
    if len(label_parts) > 0 and 'Appendix' in label_parts[1]:

ipdb gives the following context:

etree.tostring(amdpar):

'<AMDPAR>19. Add a new center heading and &#167;&#167;2.200 and 2.201 to read as follows:<EREGS_INSTRUCTIONS final_context="2-?-201"><POST label="2[title]"/><POST label="2-?-200"/><POST label="2-?-201"/></EREGS_INSTRUCTIONS></AMDPAR>\n

I am not familiar enough with the FR Notice parsing code to recognize what this might be, any ideas?

cmc333333 commented 7 years ago

Hey @gregoryfoster, I just submitted #370 to fix the explosion here -- @meriouma was definitely right.

A quick note for context: even though you're parsing part 1, we use a single structure to represent notices (which may include amendments to multiple parts.) This is leading to a seemingly unrelated amendment messing up your parse -- sorry that that's not obvious!

Second, regarding the specific amendment: it's adding a "heading" in front of two sections, which sounds similar to a subpart or a subject group, but doesn't match any concept we have; we have headings within sections, but not between them. In any event, something to keep an eye on if you're planning to parse part 2 later.

gregoryfoster commented 7 years ago

Thanks for your quick fix, @cmc333333. There's an identical problem for interpretations which I've fixed in PR #371, though please see my comment about my accidental inclusion of your commit from #370.

Unfortunately, I'm continuing to have trouble with this particular amendment for the reason you anticipated in your comment: headings between sections. Aside, this notice (68 FR 48286) repeats that same instruction pattern in amendment 20.

@meriouma identified this error and coded a workaround here. However, I've begun to feel like we're kicking the can down the road as the last three errors in 37 CFR 1 are caused by this particular notice amendment pattern.

After applying PRs #370 & #371, the parser stops on create_xml_changes(amended_labels, section, notice_changes) in regparser/notice/amendments/fetch.py:121

ipdb> print amended_labels
[(POST, ['2']), (POST, ['2', '200']), (POST, ['2', '201'])]
ipdb> print section
None
ipdb> print notice_changes
<regparser.notice.changes.NoticeChanges object at 0x10a74add0>

section being None is a problem, and that traces up to fetch_amendments(notice_xml) in regparser/notice/amendments/fetch.py:76, where the cache presents the following content object for the amendment:

ipdb> print content
Content(struct=None, amends=[(POST, ['2']), (POST, ['2', '200']), (POST, ['2', '201'])])

My interpretation is the parser isn't sure where to place this first instruction (the title between sections). What's your read, and where do you think we should go from here?

cmc333333 commented 7 years ago

Hey @gregoryfoster, sorry for the delay!

As a quick reproduction recipe, we can run

eregs clear
eregs preprocess_notice 03-20489
eregs write_to somewhere

You'll see the error in the last step; when we added the notice-and-comment feature, we began including the list of changes in the output JSON for each notice. Unfortunately, this increased the chance that misparses of an unrelated regulation part might cause yours to explode. I think the most immediate path forward is to wrap that step in a try-catch (with a warning); this field isn't used unless we're displaying a notice during the N&C period.

Accounting for the specific error's a bit more tricky as that header is causing two issues. When we try to piece together what changes amendments are making, we effectively take two passes; the header causes an issue with each.

In the first phase (a preprocessing plugin), the AMDParser looks at the amendment instructions (AMDPARs) and tries to derive a more machine-readable list of changes. It then inserts its analysis into the XML metadata (so that we aren't modifying the actual amendment text). This process is trying to derive meaning from natural language, so it's rather error prone. In this case, it tokenized amendment 20 (of 03-20489, "Add a new center heading and §§2.206 through 2.209 to read as follows:") into

Combined with contextual information those tokens lead to instructions to create a new heading for all of part 2 and create new sections 2.206 and 2.209. This is clearly not what we wanted -- not only should the heading not apply to the whole of 37 CFR 2, we're not accounting for the "through", which would imply the creation of 2.207 and 2.209 as well. So, we'd want to change the AMDParser's logic (and potentially grammar) to account for a situation like this. I'll note here that the preprocessing step will skip any amendments which already have EREGS_INSTRUCTIONS -- it's therefore possible that one could manually add the XML to get the right effect.

The second phase involves finding XML for specific instructions (i.e. the "contents" of the change rather than the instruction). The error you're seeing indicates that the parser's not finding the appropriate XML for the 2, 2.206, nor 2.209 instructions. As noted above, the first is understandable (and should probably be fixed in the AMDParsing stage) but the latter two are likely a bug with the logic that determines which XML section to grab. It looks like this logic wouldn't find the sections due to that extra header. That should be a pretty small change, I'd think.

Hope all the context helps! Let us know if we can help.