freelawproject / eyecite

Find legal citations in any block of text
https://freelawproject.github.io/eyecite/
BSD 2-Clause "Simplified" License
114 stars 27 forks source link

Citation parser fails for statutes with letters in the section number #146

Open jmesserschmidt1 opened 1 year ago

jmesserschmidt1 commented 1 year ago

U.S. code statutes with letters in them appear to be unrecognized. So "18 U.S.C. § 1028" and "18 U.S.C. § 1028(a)" are parsed, but "18 U.S.C. § 1028A" is not. I've tried some variations, but seems to be consistent.

mlissner commented 1 year ago

Thanks for sending this along. I think this would be pretty easy to fix, but our code parsers aren't particularly advanced compared to our opinions parsers.

Do you want to take a stab at it?

jmesserschmidt1 commented 1 year ago

Thanks for sending this along. I think this would be pretty easy to fix, but our code parsers aren't particularly advanced compared to our opinions parsers.

Do you want to take a stab at it?

Sure. Not super familiar with the code, but suspect might need a variation on the law_section regex similar to the one that exists for page or volume, like here. This comes up with CFR cites as well (e.g., 17 CFR § 240.10b-5 is currently parsed as 17 CFR 240). So something like (?P<section>\\d+(?:[\\-.:]\\d+){,3})[a-zA-Z]{0,4}) and (?P<section>\\d+(?:[\\-.:]\\d+){,3})[a-zA-Z]{0,4})

mlissner commented 1 year ago

I don't know that part of the code very well either, but if you want to do a PR with tests that fixes this, I think we'd probably merge it (and release a new version, if desired).