eregs / regulations-parser

Parser for U.S. federal regulations and other regulatory information
Creative Commons Zero v1.0 Universal
36 stars 40 forks source link

Unify XML parsing schemes #350

Open cmc333333 opened 7 years ago

cmc333333 commented 7 years ago

We currently use (at least) three separate, but very similar schemes for parsing XML tags. For appendices, we have a single class, AppendixProcessor, which steps through the tags within an APPENDIX and calls a method for each. At the end of each "group", it runs the depth-derivation code we use elsewhere. For regulation sections, we have ParagraphProcessor (and its subclasses), which have a list of tag-matching-and-processing children. This steps through the tags, finding the first applicable tag-matcher, then running depth derivation at the end of the process. At a higher level, we select between SECTION, SUBPART, etc. processors through a system of plugins, where the top-level PART's children are compared to the plugin tag matchers.

Each of these styles has slightly different mechanics. We should unify the best ideas from this slowly-evolving system: