eregs / regulations-parser

Parser for U.S. federal regulations and other regulatory information
Creative Commons Zero v1.0 Universal
36 stars 40 forks source link

10CFR50 parsing failures #185

Closed sixlettervariables closed 7 years ago

sixlettervariables commented 8 years ago

I'm working through access to 10 CFR 50 under the eRegs format, and ran into failures during parsing (not unheard of I'm sure). However, I'm sort of stuck as to where the problem may lie based on the errors I've received.

I began with pipeline:

X:\regulations-parser>py eregs.py pipeline 10 50 outdir
Attempting to resolve dependency: .eregs_index\notice_xml\95-17723
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
X:\regulations-parser\eregs.py in <module>()
     53
     54 if __name__ == '__main__':
---> 55     main()

X:\regulations-parser\eregs.py in main(prev_dependency)
     48         else:
     49             click.echo("Attempting to resolve dependency: " + e.dependency)
---> 50             resolvers[0].resolution()
     51             main(e.dependency)
     52

X:\regulations-parser\regparser\commands\preprocess_notice.pyc in resolution(self)
     49     def resolution(self):
     50         args = [self.match.group('doc_number')]
---> 51         return preprocess_notice.main(args, standalone_mode=False)

C:\Python27\lib\site-packages\click\core.pyc in main(self, args, prog_name, complete_var, standalone_mode, **extra)
    694             try:
    695                 with self.make_context(prog_name, args, **extra) as ctx:
--> 696                     rv = self.invoke(ctx)
    697                     if not standalone_mode:
    698                         return rv

C:\Python27\lib\site-packages\click\core.pyc in invoke(self, ctx)
    887         """
    888         if self.callback is not None:
--> 889             return ctx.invoke(self.callback, **ctx.params)
    890
    891

C:\Python27\lib\site-packages\click\core.pyc in invoke(*args, **kwargs)
    532         with augment_usage_errors(self):
    533             with ctx:
--> 534                 return callback(*args, **kwargs)
    535
    536     def forward(*args, **kwargs):

X:\regulations-parser\regparser\commands\preprocess_notice.pyc in preprocess_notice(document_number)
     19         ["effective_on", "full_text_xml_url", "publication_date", "volume"])
     20     notice_xmls = list(notice_xmls_for_url(document_number,
---> 21                                            meta['full_text_xml_url']))
     22     deps = dependency.Graph()
     23     for notice_xml in notice_xmls:

X:\regulations-parser\regparser\notice\xml.pyc in notice_xmls_for_url(doc_num, notice_url)
    151     """Find, preprocess, and return the XML(s) associated with a particular FR
    152     notice url"""
--> 153     local_notices = local_copies(notice_url)
    154     if local_notices:
    155         logging.info("using local xml for %s", notice_url)

X:\regulations-parser\regparser\notice\xml.pyc in local_copies(url)
    128 def local_copies(url):
    129     """Use any local copies (potentially with modifications of the FR XML)"""
--> 130     parsed_url = urlparse(url)
    131     path = parsed_url.path.replace('/', os.sep)
    132     notice_dir_suffix, file_name = os.path.split(path)

C:\Python27\lib\urlparse.pyc in urlparse(url, scheme, allow_fragments)
    141     Note that we don't break the components up in smaller bits
    142     (e.g. netloc is a single string) and we don't expand % escapes."""
--> 143     tuple = urlsplit(url, scheme, allow_fragments)
    144     scheme, netloc, url, query, fragment = tuple
    145     if scheme in uses_params and ';' in url:

C:\Python27\lib\urlparse.pyc in urlsplit(url, scheme, allow_fragments)
    180         clear_cache()
    181     netloc = query = fragment = ''
--> 182     i = url.find(':')
    183     if i > 0:
    184         if url[:i] == 'http': # optimize the common case

AttributeError: 'NoneType' object has no attribute 'find'

X:\regulations-parser>

I checked to see if maybe 95-17723 did not exist, but I found it on their website so I'm not sure if this is a parsing failure or a failure in a given file to properly reference 95-17723.

As suggested I then ran notice_order to see if I could better target the source of the error:

X:\regulations-parser>py eregs.py notice_order 10 50
INFO fetching notice xml for https://www.federalregister.gov/articles/xml/001/374/5.xml
INFO Starting new HTTPS connection (1): www.federalregister.gov
INFO fetching notice xml for https://www.federalregister.gov/articles/xml/001/374/6.xml
INFO Starting new HTTPS connection (1): www.federalregister.gov
INFO fetching notice xml for https://www.federalregister.gov/articles/xml/001/554/4.xml
INFO Starting new HTTPS connection (1): www.federalregister.gov
INFO fetching notice xml for https://www.federalregister.gov/articles/xml/001/825/0.xml
INFO Starting new HTTPS connection (1): www.federalregister.gov
WARNING Multiple subpart contexts in amendment: [Verb( 'PUT', active=True, and_prefix=False), Context([ '50', 'Appendix:L' , certain=True ]), Paragraph([  ], field = 'title' ), Context([ None, 'Appendix:L' , certain=True ]), AndToken(), Verb( 'MOVE', active=True, and_prefix=False), Verb( 'POST', active=True, and_prefix=False)]
WARNING Bad format for whole appendix
INFO fetching notice xml for https://www.federalregister.gov/articles/xml/002/335/6.xml
INFO Starting new HTTPS connection (1): www.federalregister.gov
INFO fetching notice xml for https://www.federalregister.gov/articles/xml/002/728/3.xml
INFO Starting new HTTPS connection (1): www.federalregister.gov
WARNING Could not derive paragraph depths. Retrying with relaxed constraints.
INFO fetching notice xml for https://www.federalregister.gov/articles/xml/003/173/5.xml
INFO Starting new HTTPS connection (1): www.federalregister.gov
INFO fetching notice xml for https://www.federalregister.gov/articles/xml/011/156.xml
INFO Starting new HTTPS connection (1): www.federalregister.gov
INFO fetching notice xml for https://www.federalregister.gov/articles/xml/017/352.xml
INFO Starting new HTTPS connection (1): www.federalregister.gov
INFO fetching notice xml for https://www.federalregister.gov/articles/xml/013/083/2.xml
INFO Starting new HTTPS connection (1): www.federalregister.gov
INFO fetching notice xml for https://www.federalregister.gov/articles/xml/022/188/9.xml
INFO Starting new HTTPS connection (1): www.federalregister.gov
WARNING Could not find Appendix D to part 20
WARNING Could not derive paragraph depths. Retrying with relaxed constraints.
ERROR Could not determine paragraph depths (<SECTION /> 32-52):
STARS
?? a
Remaining markers: []
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
X:\regulations-parser\eregs.py in <module>()
     53
     54 if __name__ == '__main__':
---> 55     main()

X:\regulations-parser\eregs.py in main(prev_dependency)
     39     to the dependency changing"""
     40     try:
---> 41         cli()
     42     except dependency.Missing, e:
     43         resolvers = [resolver(e.dependency)

C:\Python27\lib\site-packages\click\core.pyc in __call__(self, *args, **kwargs)
    714     def __call__(self, *args, **kwargs):
    715         """Alias for :meth:`main`."""
--> 716         return self.main(*args, **kwargs)
    717
    718

C:\Python27\lib\site-packages\click\core.pyc in main(self, args, prog_name, complete_var, standalone_mode, **extra)
    694             try:
    695                 with self.make_context(prog_name, args, **extra) as ctx:
--> 696                     rv = self.invoke(ctx)
    697                     if not standalone_mode:
    698                         return rv

C:\Python27\lib\site-packages\click\core.pyc in invoke(self, ctx)
   1058                 sub_ctx = cmd.make_context(cmd_name, args, parent=ctx)
   1059                 with sub_ctx:
-> 1060                     return _process_result(sub_ctx.command.invoke(sub_ctx))
   1061
   1062         # In chain mode we create the contexts step by step, but after the

C:\Python27\lib\site-packages\click\core.pyc in invoke(self, ctx)
    887         """
    888         if self.callback is not None:
--> 889             return ctx.invoke(self.callback, **ctx.params)
    890
    891

C:\Python27\lib\site-packages\click\core.pyc in invoke(*args, **kwargs)
    532         with augment_usage_errors(self):
    533             with ctx:
--> 534                 return callback(*args, **kwargs)
    535
    536     def forward(*args, **kwargs):

X:\regulations-parser\regparser\commands\notice_order.pyc in notice_order(cfr_title, cfr_part, include_notices_without_changes)
     11 def notice_order(cfr_title, cfr_part, include_notices_without_changes):
     12     """Order notices associated with a reg."""
---> 13     notices_by_date = notices_for_cfr_part(str(cfr_title), str(cfr_part))
     14     for date in sorted(notices_by_date.keys()):
     15         click.echo(date)

X:\regulations-parser\regparser\builder.pyc in notices_for_cfr_part(title, part)
    109     """Retrieves all final notices for a title-part pair, orders them, and
    110     returns them as a dict[effective_date_str] -> list(notices)"""
--> 111     notices = fetch_notices(title, part, only_final=True)
    112     modify_effective_dates(notices)
    113     return group_by_eff_date(notices)

X:\regulations-parser\regparser\federalregister.pyc in fetch_notices(cfr_title, cfr_part, only_final)
     37     notices = []
     38     for result in fetch_notice_json(cfr_title, cfr_part, only_final):
---> 39         notices.extend(build_notice(cfr_title, cfr_part, result))
     40     return notices
     41

X:\regulations-parser\regparser\notice\build.pyc in build_notice(cfr_title, cfr_part, fr_notice, fetch_xml, xml_to_process)
     53     elif fr_notice['full_text_xml_url'] and fetch_xml:
     54         xmls = xmls_for_url(fr_notice['full_text_xml_url'])
---> 55         notices = [process_xml(notice, xml) for xml in xmls]
     56         set_document_numbers(notices)
     57         return notices

X:\regulations-parser\regparser\notice\build.pyc in process_xml(notice, notice_xml)
    228
    229     process_sxs(notice, notice_xml)
--> 230     process_amendments(notice, notice_xml)
    231     add_footnotes(notice, notice_xml)
    232

X:\regulations-parser\regparser\notice\build.pyc in process_amendments(notice, notice_xml)
    178         # Process amendments relating to a specific section in batches, too
    179         for section_xml, related_amends in amendments_by_section.items():
--> 180             for section in reg_text.build_from_section(cfr_part, section_xml):
    181                 create_xml_changes(related_amends, section, notice_changes)
    182

X:\regulations-parser\regparser\tree\xml_parser\reg_text.pyc in build_from_section(reg_part, section_xml)
    284
    285         section_nodes.append(
--> 286             RegtextParagraphProcessor().process(section_xml, sect_node)
    287         )
    288     return section_nodes

X:\regulations-parser\regparser\tree\xml_parser\paragraph_processor.pyc in process(self, xml, root)
    139                                   constraints)[0].pretty_str(),
    140                     markers[fails_at], markers[fails_at + 1:])
--> 141             depths = self.select_depth(depths)
    142             return self.build_hierarchy(root, nodes, depths)
    143         else:

X:\regulations-parser\regparser\tree\xml_parser\paragraph_processor.pyc in select_depth(self, depths)
     45         depths = heuristics.prefer_diff_types_diff_levels(depths, 0.8)
     46         depths = heuristics.prefer_multiple_children(depths, 0.4)
---> 47         depths = heuristics.prefer_shallow_depths(depths, 0.2)
     48         depths = heuristics.prefer_no_markerless_sandwich(depths, 0.2)
     49         depths = sorted(depths, key=lambda d: d.weight, reverse=True)

X:\regulations-parser\regparser\tree\depth\heuristics.pyc in prefer_shallow_depths(solutions, weight)
     43     """Dock solutions which have a higher maximum depth"""
     44     # Smallest maximum depth across solutions
---> 45     min_max_depth = min(max(p.depth for p in s.assignment) for s in solutions)
     46     max_max_depth = max(p.depth for s in solutions for p in s.assignment)
     47     variance = max_max_depth - min_max_depth

ValueError: min() arg is an empty sequence

Now this points to possibly a different culprit, so I'm not exactly sure how to proceed.

cmc333333 commented 8 years ago

Hey @sixlettervariables , we're so glad to see you trying this out! I've got some good news and some bad news. The bad news is that this regulation has several concepts and fiddley bits we haven't encountered before; I'll create issues for the ones I've found. The good news is that there are some minor tweaks that can be made to the XML file input that will massage it into a format we do understand; even better, I have those tweaks as a patch for you ;)

First, run:

$ eregs clear
$ eregs pipeline 10 50 outdir --only-latest

This will

  1. Clear out some of the work that the parser's tried to perform for you
  2. Create a single version of the regulation (rather than trying to look back through time)
  3. Crash :)

The crashing is okay in this case; it'll have downloaded and preprocessed a version of the regulation as XML. Apply this patch to it and you'll get something the parser can process. For the most part, that patch just moves around text, giving better hints to the parser around subparagraphs and the like; however, I did delete a few tables to kick the parser a little harder.

Once that patch has been applied to .eregs_index/annual/10/50/2015, re-run the pipeline command. With any luck, it'll pickup the modifications and spit out a working set of JSON files.

Let us know how this goes and if we can help further!

sixlettervariables commented 8 years ago

Outstanding, thank you! I'll try this when I get home tonight.

The bad news is that this regulation has several concepts and fiddley bits we haven't encountered before...

Welcome to the Nuclear industry!

cmc333333 commented 8 years ago

Hey @sixlettervariables, did you make any headway here?