eregs / regulations-parser

Parser for U.S. federal regulations and other regulatory information
Creative Commons Zero v1.0 Universal
36 stars 40 forks source link

37 CFR 1 § 1.445 - 2012 annual edition introduces unusual tables #368

Open gregoryfoster opened 7 years ago

gregoryfoster commented 7 years ago

Dev environment: current Master [ 0c650cda ] + PR #367 [ a389c4b7 ].

Running: eregs --debug annual_editions 37 1 results in an error when processing § 1.445 of the 2012 annual edition: https://www.gpo.gov/fdsys/pkg/CFR-2012-title37-vol1/xml/CFR-2012-title37-vol1-part1.xml#seqnum1.445

a.1.i is parsed as:

Node(text=u'||\n||\n|(i) A basic portion|$240.00|', children=[], label=['MARKERLESS'], title=None, node_type=u'regtext')

and the section formerly titled a.1.iii is parsed as:

Node(text=u'||\n||\n|By a small entity (\xa7 1.27(a))|$200.00|\n|By other than a small entity|$400.00|', children=[], label=['MARKERLESS'], title=None, node_type=u'regtext')

In both cases, the table formatting appears to be confusing some aspect of the parser, resulting in labeling both sections MARKERLESS. The absence of the a.1.i marker raises an error at regparser/tree/depth/heuristics.py:47.

2017-04-05 17:13:54 regparser.tree.xml_parser.paragraph_processor Could not derive paragraph depths. Retrying with relaxed constraints.
2017-04-05 17:13:54 regparser.tree.xml_parser.paragraph_processor Could not determine paragraph depths (<SECTION /> 1-445):
a
    1
        MARKERLESS
?? ii
Remaining markers: ['MARKERLESS', '2', '3', '4', 'b']
> /Volumes/O-RANGE/organizations/markup/eregulations/regulations-parser/regparser/tree/depth/heuristics.py(47)prefer_shallow_depths()
     46     # Smallest maximum depth across solutions
---> 47     min_max_depth = min(max(p.depth for p in s.assignment) for s in solutions)
     48     max_max_depth = max(p.depth for s in solutions for p in s.assignment)
cmc333333 commented 7 years ago

@gregoryfoster noted that this gets even more hairy in the 2013 version: https://www.gpo.gov/fdsys/pkg/CFR-2013-title37-vol1/xml/CFR-2013-title37-vol1-part1.xml#seqnum1.445

I think the ideal outcome in the 2013 version would be depths like:

(a)
    (1)
        (i)
            (A)
                Table
            (B)
        (ii)
            Table
    (2)
        (i)
            Table
        (ii)
    (3)
        (i)
            Table
        (ii)
    (4)
(b)

That's possible to implement now by modifying the XML, but it'll be a slog.

One potential alternative: right now, we 1) try to derive depths 2) if that fails, we loosed the requirements and try again 3) if that also fails, we print the situation and explode

we could 1) try to derive depths 2) if that fails, we loosed the requirements and try again 3) if that also fails, print the situation and complain loudly 4) set each node's marker to MARKERLESS (to avoid conflicting paragraphs) 5) derive depths again (they'll all be on the same level now)

The output would be a non-indented section. The major downside is that having a non-indented section will likely confuse diffs and will certainly cause issues when applying changes (final rules). Flattening the paragraphs could bet a user over a hump, but may leave them more confused when things don't work correctly later.

gregoryfoster commented 7 years ago

Thanks for documenting this, @cmc333333. Technically speaking, would you say the XML is accurate but the placement of markers within GPOTABLE objects is unexpected? For example, here's the 2012 § 1.445 a.1 table which includes a row with marker i:

<P>(1) A transmittal fee (see 35 U.S.C. 361(d) and PCT Rule 14) consisting of:</P>
<GPOTABLE CDEF="s30,8" COLS="2" OPTS="L0,tp0,p1,8/9,g1,t1">
  <ROW>
    <ENT I="01">(i) A basic portion</ENT>
    <ENT>$240.00</ENT>
  </ROW>
</GPOTABLE>

The 2013 annual edition expands on this precedent of tables containing markers (but not on every row!). Assuming that we need to modify the regulations-parser infrastructure to handle these circumstances, how would you approach this challenge?

cmc333333 commented 7 years ago

I'd argue that the XML isn't accurate -- in this case, I think it should be:

<P>(1) A transmittal fee (see 35 U.S.C. 361(d) and PCT Rule 14) consisting of:</P>
<P>(i)</P>
<GPOTABLE CDEF="s30,8" COLS="2" OPTS="L0,tp0,p1,8/9,g1,t1">
  <ROW>
    <ENT I="01">A basic portion</ENT>
    <ENT>$240.00</ENT>
  </ROW>
</GPOTABLE>

I see a few options for how to proceed: 1) Fix the XML manually. This may not be sustainable, but is easiest. 2) Fix the XML automatically. If we can define rules to account for these types of edits, we can apply them as a preprocessing step to the XML. This might not be feasible, particularly when considering the 2013 (which would appear to have legitimate errors in which § 1.445(a)(1)(i)(B), (2)(ii), and (3)(ii) have dollar amounts). 3) Adjust the parser to have that fallback mode. I'm leaning against this one because it'll lead to confusing errors later down the road (when compiling regulation versions).

So, 1) will work, but is pretty ham fisted. 2) may work, but would require some very careful consideration. 3) may work, but seems like the wrong direction to me.

gregoryfoster commented 7 years ago

I'd be up for modifying the source XML, but further research makes me think that may not be the right path.

There are earlier precedents in 37 CFR 1 of usage of GPOTABLES which contain regulation markers. Here's § 1.19 from the 2001 annual edition: https://www.gpo.gov/fdsys/pkg/CFR-2001-title37-vol1/xml/CFR-2001-title37-vol1-part1.xml#seqnum1.19

I'm guessing that no earlier tables caused an issue for the parser because their rows contain regulation markers at the same depth or deeper. These nodes would be flagged as MARKERLESS and therefore not cause a problem in relation to surrounding definable regulation marker nodes. It was only the circumstance of § 1.445 (a)(1)(i) inside a table vs. (a)(1)(ii) outside that table which raised a depth traversal issue.

I'm tentatively interpreting this to mean that other examples of tables containing regulation markers are not being captured by the parser, but instead interpreted as flattened MARKERLESS nodes. There are a lot of tables like this in 37 CFR 1, so I'm not sure overriding the source XML is going to be a viable approach.

Is there another path? One which accepts that GPOTABLES can contain regulation markers in practice, and therefore attempts to parse regulation markers from within those rows?

cmc333333 commented 7 years ago

You can think of the nodes as rendering like a bunch of lis within ols. Using the XML I suggested, we'd see a tree like:

1-445
    1-445-a
        1-445-a-1
            1-445-a-1-i
                1-445-a-1-i-p1 (markerless table)
            1-445-a-1-ii

rendered as:

<ol>
    <li><p>(a) ....</p><ol>
        <li><p>(1) ....</p><ol>
            <li><p>(i) ....</p><ol>
                <li><table /></li>
            </ol></li>
            <li><p>(ii) ....</p></li>
        </ol></li>
    </ol></li>
</ol></li>        

I think the first step is to think about what the ideal markup would be and try to work back to what that'd require in terms of parsing. In this scenario, I think the ideal markup looks like the above, which leads me to support modifying the XML to match.

Here's some more thoughts off the top of my head:

Option 4: Hypothetically, we could allow some special logic around tables so that nodes within a table rendered as rows (or something), though that'd be a pretty heavy rework. Consider how the current XML would need to be parsed:

1-445
    1-445-a
        1-445-a-1
            1-445-a-1-p1 (markerless table)
                1-445-a-1-p1-i (the first row)
                1-445-a-1-p1-p2 (the second row)
            1-445-a-1-ii

to become

<ol>
    <li><p>(a) ....</p><ol>
        <li><p>(1) ....</p><ol>
            <li><table>
                <tr><td>(i) ... </td><td>...</td></tr>
                <tr><td> ... </td><td>...</td></tr>
            </table></li>
            <li><p>(ii) ....</p></li>
        </ol></li>
    </ol></li>
</ol></li>        

I'm not sure how we'd handle additional paragraph depths -- we'd want (a)(1)(i)(A) to be within (a)(1)(i), but if it's a separate row in the table, that won't be possible. Unfortunately, this approach doesn't directly resolve the depth derivation issue, either, as we'd be mixing paragraphs with markers and without on the same level and using (i) at one depth, but (ii) at a different depth. I think this approach matches the XML pretty well, but doesn't match the structure of the regulation.

Option 5: When we go to derive the depths, we could do some sort of deep inspection of the contents of the table and virtually "expand" the table to include all of the markers it contains.

1-445
    1-445-a
        1-445-a-1
            1-445-a-1-p1 (virtually containing 1-445-a-1-i, potentially others)
            1-445-a-1-ii

This would get over the depth-derivation hump, but wouldn't be a complete solution. Consider what happens when we want to reference 1-445-a-1-i (e.g. in a citation): that node doesn't exist in the tree. We could add the same "virtual" searching logic, but we'll quickly be re-implementing node trees in this "virtual" space, which seems to defeat the point.

Option 6: We could try to split the single table into multiple nodes. This is very similar to the preprocessing logic proposed in option 2 and would carry the same risks -- the difference is where the logic is placed (is it a preprocessing step, or is it just part of how GPOTABLEs are parsed?). This (like option 2) has the benefit of being automated, but I suspect the logic will be rather complex.

gregoryfoster commented 7 years ago

Thanks for engaging on this one, @cmc333333. After some thought this week, I think we've surfaced two separate issues:

  1. eregs/regulations-parser does not currently process regulatory markers within GPOTABLE objects. Fixing this will be a bear.
  2. There is a rare anti-pattern of regulatory markers within and outside of GPOTABLE objects which alerted us to the first issue.

I decided to try fixing the second issue to see how many instances we'd encounter. I fixed the first identified anti-pattern (2012 annual edition 37 CFR 1 § 1.445) by locally overriding the file with the ideal XML you outlined in an earlier comment on this issue. The parser then identified a similar issue in the 2013 annual edition 37 CFR 1 § 1.17 (see marker (j)) which was easier to fix by pushing the marker into an adjoining GPOTABLE. I've bundled those changes into a PR for the fr-notices project. Since we're already overriding those annual editions, this is an expeditious choice.

I suggest we keep this issue open, but change its focus to the big issue of handling GPOTABLE objects with regulatory markers in them.

gregoryfoster commented 7 years ago

I've opened dialogue with the USPTO's Office of Patent Legal Administration to get the errors we've been surfacing in the GPO's XML versions of 37 CFR 1 fixed. I've been directed to the USPTO's online version of their regulations, the Manual of Patent Examining Procedure (MPEP), Appendix R, which appears to be the primary source for the agency (only current revision in HTML and PDF).

It appears that some errors like the current issue are cropping up during a transformation of the MPEP source documents into the XML format deployed by GPO. If you look at 37 CFR 1 § 1.445 in the MPEP, you'll see the tables appear offset and do not include the regulation markers. Here's the GPO's current version (you'll have to find/scroll as anchors are useless in the current version - unless you know how to URL encode a thin space character?).

Does 18F have any insight into the transformation of source documents between the USPTO and GPO?

cmc333333 commented 7 years ago

Hey @gregoryfoster, I'd doubt that MPEP is the source for the GPO XML. From what we've seen, agencies send over Word docs to indicate amendments to regulations -- they don't send over whole regulations unless the whole part is being replaced.

It's possible that the transform runs the other way (where the MPEP is downstream from the GPO) and then modified (by hand) further. This is the route CFPB's taken -- the "original" content from the GPO has been transformed (with a mix of automation and manual work) into a new document, which is then maintained separately. Of course, having this separate document means that their regulations may not always match the GPO (which carries more legal weight, if I understand correctly).