freelawproject / reporters-db

A database of court reporters, tests and other experiments
BSD 2-Clause "Simplified" License
93 stars 34 forks source link

Validate Regex patterns #94

Closed flooie closed 2 years ago

flooie commented 2 years ago

Add test to validate all regexes (save one) has a valid volume, reporter and page group.

mlissner commented 2 years ago

Hm, check out line 129 in tests. It looks like we already have a test for this, but somehow it didn't work?

jcushman commented 2 years ago

FWIW I don't think it makes sense to check for volume at all -- anything with full_cite_paragraph or full_cite_single_volume can come back without a volume, as well as the ones with "cite_format": "{reporter} {page}".

flooie commented 2 years ago

@jcushman couldn't we just append (?:(?P<volume>) )? to any regex pattern without a volume and move on?

flooie commented 2 years ago

@jcushman we do (?:(?P<volume>1) )? for full_cite_single_volume

flooie commented 2 years ago

Hm, check out line 129 in tests. It looks like we already have a test for this, but somehow it didn't work?

Screen Shot 2022-05-12 at 10 36 45 AM

@mlissner looks like @mattdahl added it after he discovered the missing reporter originally.

jcushman commented 2 years ago

couldn't we just append (?:(?P) )? to any regex pattern without a volume and move on?

Well, the regex should match the citations you want to find; the tests should match the regex. If the regexes without P<volume> are wrong and those reporters actually have volumes sometimes, sure, change the regexes to match the citations. Otherwise I'd just delete the volume check, because tests shouldn't assert something that isn't true. (And then you can also delete the if "RIA Federal Tax Handbook" exception, which is showing up in your code as a signal that you're testing for something that isn't accurate.)

flooie commented 2 years ago

I think this is all moot because @mattdahl fixed it for us and we wouldn't have even noticed if we had just published the latest version earlier. I think his check for reporter and page is probably right. I'm going to close this PR. Thanks @jcushman @mlissner and @mattdahl