ambirdsall / lawfetcher

parses US legal citations and builds URLs to source documents
http://ambirdsall.com/lawfetcher/
4 stars 1 forks source link

Properly handling multiple citations requires an Actual Parser #2

Open ambirdsall opened 5 years ago

ambirdsall commented 5 years ago

The shorthand syntax for combining multiple related citations is intuitive to humans, but resolving the full, independent form of citations after the first requires context: specifically, figuring out which part of the preceding string should be prepended to any specific comma-delimited citation fragment takes detailed knowledge of the structure of the citation(s) that came before. This is fundamentally beyond the limits of regular expressions, for the same reason it's impossible to write a regular expression which only matches correctly nested pairs of parentheses.

More precisely, or at least with more jargon, the grammar of the US legal citation system is either a context-free grammar or a context-sensitive grammar (I think it's context-free, but I'm new enough at this stuff that I need to really delve into the production rules to figure it out) in terms of the Chompsky hierarchy, and regular expressions (which are, naturally, regular grammars) simply aren't powerful enough to describe them. (n.b. the wikipedia entries are not very approachable; I found this helpful, though it still assumes some familiarity with the concepts.)

It's possible to directly adapt the current approach to handle multiple citations by hacking together a bunch of complicated regexes to tackle sub-parts of the task, orchestrated by some ad-hoc "glue" code, but that would be hard to understand and easy to break. It would be much easier in the end to adapt some of the techniques of programming language parsing. A legitimate parser which splits out the distinct steps of

would be more easily able to handle complicated edge cases and typos gracefully. Two steps back, five steps forwards.

cf. http://www.craftinginterpreters.com/parsing-expressions.html for a nice hand-holding walkthrough of a basic recursive descent parser. Recursive descent seems like the best fit: it's conceptually simpler than most of the competition, relatively easy to translate into normal code, and is a popular and well-documented approach.

ambirdsall-gogo commented 3 years ago

ast