The shorthand syntax for combining multiple related citations is intuitive to humans, but resolving the full, independent form of citations after the first requires context: specifically, figuring out which part of the preceding string should be prepended to any specific comma-delimited citation fragment takes detailed knowledge of the structure of the citation(s) that came before. This is fundamentally beyond the limits of regular expressions, for the same reason it's impossible to write a regular expression which only matches correctly nested pairs of parentheses.
More precisely, or at least with more jargon, the grammar of the US legal citation system is either a context-free grammar or a context-sensitive grammar (I think it's context-free, but I'm new enough at this stuff that I need to really delve into the production rules to figure it out) in terms of the Chompsky hierarchy, and regular expressions (which are, naturally, regular grammars) simply aren't powerful enough to describe them. (n.b. the wikipedia entries are not very approachable; I found this helpful, though it still assumes some familiarity with the concepts.)
It's possible to directly adapt the current approach to handle multiple citations by hacking together a bunch of complicated regexes to tackle sub-parts of the task, orchestrated by some ad-hoc "glue" code, but that would be hard to understand and easy to break. It would be much easier in the end to adapt some of the techniques of programming language parsing. A legitimate parser which splits out the distinct steps of
splitting an input string into its constituent tokens or "words"; and
combining those tokens into a logical structure (i.e., an abstract syntax tree for citation expressions)
would be more easily able to handle complicated edge cases and typos gracefully. Two steps back, five steps forwards.
cf. http://www.craftinginterpreters.com/parsing-expressions.html for a nice hand-holding walkthrough of a basic recursive descent parser. Recursive descent seems like the best fit: it's conceptually simpler than most of the competition, relatively easy to translate into normal code, and is a popular and well-documented approach.
The shorthand syntax for combining multiple related citations is intuitive to humans, but resolving the full, independent form of citations after the first requires context: specifically, figuring out which part of the preceding string should be prepended to any specific comma-delimited citation fragment takes detailed knowledge of the structure of the citation(s) that came before. This is fundamentally beyond the limits of regular expressions, for the same reason it's impossible to write a regular expression which only matches correctly nested pairs of parentheses.
More precisely, or at least with more jargon, the grammar of the US legal citation system is either a context-free grammar or a context-sensitive grammar (I think it's context-free, but I'm new enough at this stuff that I need to really delve into the production rules to figure it out) in terms of the Chompsky hierarchy, and regular expressions (which are, naturally, regular grammars) simply aren't powerful enough to describe them. (n.b. the wikipedia entries are not very approachable; I found this helpful, though it still assumes some familiarity with the concepts.)
It's possible to directly adapt the current approach to handle multiple citations by hacking together a bunch of complicated regexes to tackle sub-parts of the task, orchestrated by some ad-hoc "glue" code, but that would be hard to understand and easy to break. It would be much easier in the end to adapt some of the techniques of programming language parsing. A legitimate parser which splits out the distinct steps of
would be more easily able to handle complicated edge cases and typos gracefully. Two steps back, five steps forwards.
cf. http://www.craftinginterpreters.com/parsing-expressions.html for a nice hand-holding walkthrough of a basic recursive descent parser. Recursive descent seems like the best fit: it's conceptually simpler than most of the competition, relatively easy to translate into normal code, and is a popular and well-documented approach.