eregs / regulations-parser

Parser for U.S. federal regulations and other regulatory information
Creative Commons Zero v1.0 Universal
36 stars 40 forks source link

Error when meta['full_text_xml_url'] is missing #355

Closed meriouma closed 7 years ago

meriouma commented 7 years ago

I've tried running eregs clear && eregs pipeline 37 1 http://localhost:8080 and end up with this stacktrace :

  File "regulations-parser/regparser/commands/preprocess_notice.py", line 61, in preprocess_notice
    notice_xmls = list(notice_xmls_for_url(meta['full_text_xml_url']))
  File "regulations-parser/regparser/notice/xml.py", line 446, in notice_xmls_for_url
    local_notices = local_copies(notice_url)
  File "regulations-parser/regparser/notice/xml.py", line 423, in local_copies
    parsed_url = urlparse(url)
  File "/usr/lib/python2.7/urlparse.py", line 143, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
  File "/usr/lib/python2.7/urlparse.py", line 182, in urlsplit
    i = url.find(':')
AttributeError: 'NoneType' object has no attribute 'find'

I've tracked it down to meta['full_text_xml_url'] returning None. The log entry I have just before is Attempting to resolve dependency: .eregs_index/notice_xml/95-6377.

Should I try to skip that notice?

cmc333333 commented 7 years ago

Unfortunately, you're running into #186 -- the Federal Register hasn't converted many of their pre-2000 documents into XML yet. We should probably be issuing a warning and skipping these.

If you're up for it, I think the best place to do this would be when we generate the initial list of versions from the Federal Register's API. We can probably request the full_text_xml_url field (if we don't already) and filter the results to include only notices where that's present. I think that logic would be somewhere inside this function: https://github.com/eregs/regulations-parser/blob/47c8648fb0ea1f19bba40dcab327b7c98db78275/regparser/commands/versions.py#L15

meriouma commented 7 years ago

Ah thanks for the pointer, I didn't look for related issues. I'll try to see what I can achieve!

meriouma commented 7 years ago

Here's what I tried. I commented changes from 1 to 13 in the commit (in the order I added the changes). At the moment it doesn't yield errors anymore, but it seems to end up in an infinite loop. I guess my changes created inconsistent data. https://github.com/navigo/regulations-parser/commit/881df6fff5d561831bd5ce815fbd41ceb04323a1

gregoryfoster commented 7 years ago

It looks like @cmc333333 addressed the initial subject of this issue in this commit [ 7bb4092 ] which was merged into eregs/regulations-parser master as part of #186. Shall we close this issue and create new issues to track the additional challenges @meriouma has identified in attempting to parse 37 CFR Part 1? Is there a protocol eregs likes to follow in this circumstance?

cmc333333 commented 7 years ago

I'd be for closing this issue and creating another for the other 37 CFR 1 issues. That said, there's not a whole lot of traffic in this repo given that the 18F team is focused on OMB; having an extra issue doesn't hurt :)

cmc333333 commented 7 years ago

@gregoryfoster's been using other issues to track 37 CFR 1 errors, so I'm going to close this one.