Does regulations-parser need to validate document numbers via the FR?

konklone commented 10 years ago

As documented in this gist, I'm trying to use regulations-parser on a piece of the FAR, 48 CFR 301. The "gist" of the issue is that FederalRegister.gov hasn't correctly parsed the CFR references out of 48 CFR 301's controlling rule, E9-26948, so build_from.py only sees the previous and replaced rule from 2006.

I reported this to the FederalRegister.gov devs, but does regulations-parser need to validate the document number I provide over the command line to proceed?

cmc333333 commented 10 years ago

That's an interesting issue; unfortunately the FR data isn't always correct.

Right now, the parser requires a valid (in the sense that it can be retrieved from the FR's api) document number so that it can build up a history of notices. One hacky solution would be to test it out using the only notice that can be found (E6-21505,) just to get you over the hump. A solution that we might try in code is to load the document number provided directly (rather than searching through notices provided by a title/part search.)

As a personal aside, I'm super glad you're trying this out! We haven't tried it on any HHS regs; please let us know if (read when) you run into other errors ;)

konklone commented 10 years ago

I can work around it with the 2006 version as an interim thing, hopefully the FR fixes their parsing issue.

This is probably a separate ticket, but the main issue for me is that my real goal, the FAR, is an entire Title, not a Part. Each agency has a whole Chapter, which has many Parts. So the manual introductory step of copy/pasting from the eCFR isn't really workable. A scraping step that automatically fetches the eCFR and does that work would help a lot. Unfortunately, that might require some initial scraping of the tables of contents to establish the hierarchy, because the eCFR URLs don't seem predictable by absolute numbers - the URL for 48 CFR 301 is http://www.ecfr.gov/cgi-bin/text-idx?node=48:4.0.1.1.1&rgn=div5. :/

konklone commented 10 years ago

Proceeding with the workaround, providing the E6 document number, I get this:

Traceback (most recent call last):
  File "build_from.py", line 80, in <module>
    for v in reader.regversions(cfr_part)['versions']
  File "/home/eric/sunlight/regulations-parser/regparser/diff/api_reader.py", line 22, in regversions
    return self._get("regulation/%s" % label)
  File "/home/eric/sunlight/regulations-parser/regparser/diff/api_reader.py", line 46, in _get
    f = open(self.base_url + suffix)
IOError: [Errno 2] No such file or directory: 'regulation/301/index.html'

In the generated regulation/301 directory, there is a file named E6-21505 that contains some metadata about that document, but no index.html.

Also, there's a separate bug where api_reader.py isn't taking the OUTPUT_DIR into account.If I try overwriting the OUTPUT_DIR in local_settings.py with OUTPUT_DIR = 'far/', I get this:

Traceback (most recent call last):
  File "build_from.py", line 80, in <module>
    for v in reader.regversions(cfr_part)['versions']
  File "/home/eric/sunlight/regulations-parser/regparser/diff/api_reader.py", line 22, in regversions
    return self._get("regulation/%s" % label)
  File "/home/eric/sunlight/regulations-parser/regparser/diff/api_reader.py", line 46, in _get
    f = open(self.base_url + suffix)
IOError: [Errno 2] No such file or directory: 'regulation/301'

When I add some debug output, self.base_url is still an empty string (I haven't overridden API_BASE), and suffix is "regulations/301".

cmc333333 commented 10 years ago

Just made a pull request (#118) that fixes this... sort of. The parser's trying to calculate diffs between the version parsed and any previous versions of the regulation; to do that, it's trying to read from an api. Worse, calculating diffs couldn't be turned off without that pull request. The former problem, trying to read a listing of regulations from a newly-populated, static, API definitely warrants a ticket.

If you apply #118 (and use false as the final parameter to build_from.py,) you will get rid of the warning, but it won't change the data at all. The parsed regulation (as JSON) is in regulation/301/E6-21505 (or far/regulation/301/E6-21505,) with additional directories for layers and read notices.

Generally, the parser flips out when there is an error (rather than generating a tree with missing data), but if the regulation JSON isn't what you expected, would you mind posting 301-hhs-far.txt somewhere that we can take a look?

konklone commented 10 years ago

Yeah, I applied your PR and re-ran it, output looks identical either way. I put the .txt file and some resulting JSON output in this gist, so you can see what I'm getting. I'm not personally sure if it looks right.

cmc333333 commented 10 years ago

Hoo boy. That has all sorts of things the parser doesn't expect! Luckily, it doesn't fall completely on its face if you tweak the reg text a wee bit. This gist removes all of the "return arrow" lines and adds section markers.

Using the tweaked reg, the parser builds an okay-looking tree for me. It's got some issues, but may work in your use case. The biggest issue I see is that section markers with dashes are quite unexpected, so you'll see multiple nodes with the same "label". This wouldn't work for the api (it assumes uniqueness,) but may work if you're just using the JSON. We also make the assumption that subparts are lettered, so the subpart nodes aren't being generated correctly.

konklone commented 10 years ago

It feels like a good first step to making this parser more broadly applicable is to add a step where you provide the Title (and/or Chapter, and/or Part) of the CFR, and it extracts the appropriate text from the eCFR, removes the "return arrow" stuff, etc. Would that be a useful contribution?

cfpb / regulations-parser

Does regulations-parser need to validate document numbers via the FR? #117