eregs / regulations-parser

Parser for U.S. federal regulations and other regulatory information
Creative Commons Zero v1.0 Universal
37 stars 39 forks source link

[WIP] Merge spans longer than 3 reserved sections #300

Closed vrajmohan closed 8 years ago

vrajmohan commented 8 years ago

Identify these spans by the starting section number. Update the section title to indicate the sections in the span.

vrajmohan commented 8 years ago

Questions:

  1. Is there ever a span of sections that are not RESERVED?
  2. Should the span indicator be explicit rather than implicit in the title?
cmc333333 commented 8 years ago

This looks good!

Looks like test failures are integration tests for FEC and EPA, which probably both have runs of > 3 reserved sections. Would you mind verifying the data on FEC and I'll take a look at EPA?

  1. Is there ever a span of sections that are not RESERVED?

Good question -- we haven't seen any that weren't reserved across the four eRegs agencies, so I think that's a safe bet now.

  1. Should the span indicator be explicit rather than implicit in the title?

I don't follow; would you mind giving an example?

vrajmohan commented 8 years ago

I'll verify the integration test failures for FEC.

vrajmohan commented 8 years ago

Should the span indicator be explicit rather than implicit in the title?

In https://github.com/eregs/regulations-parser/blob/767aa18e7f5d8297c6ddc268ab079d8f37dcc09a/regparser/tree/xml_parser/reg_text.py#L256, we already know that we are dealing with sections and/or section spans. We don't indicate this information explicitly in the persisted output. It is implicit in the label and the title. In https://github.com/eregs/regulations-site/blob/23fd2ab001782a9654e26496aa96870d16b71614/regulations/generator/title_parsing.py#L40, we extract this information again using similar logic.

cmc333333 commented 8 years ago

@vrajmohan gotcha. I think they're working slightly different angles. The build_from_section logic is meant to only concern itself with the creation of Node objects, so it cares about spans of sections only to the extent that it needs to create multiple Nodes or not. We're doing a bit of reformatting there, but I think that's more out of convenience than cleanliness.

I don't think that section-spanning data is very important to Nodes; but it is important to the table-of-contents layer. I'd argue that it should be explicit there. Unfortunately the relevant logic is actually within -site :(. We run into a similar problem in the N&C code -- it'd be ideal if the parser was doing the parsing rather than the UI.

cmc333333 commented 8 years ago

I can confirm that the only changes to EPA are expected -- several runs of sections are merged into a single "reserved" section.

vrajmohan commented 8 years ago

Likewise, I can confirm that the only changes to FEC are expected -- several runs of sections are merged into a single "reserved" section.

cmc333333 commented 8 years ago

Looks good, test failures make sense, and @vrajmohan assures this is no longer a WIP