Relaxed regex in getCaseNumberFromUrls matches too many links

mlissner commented 6 years ago

The regex we just pulled in PR #34 relaxes a regex from:

url.match(/\?(\d+)$/);

To:

url.match(/\?(\d+)/))

Alas, this now matches URLs like:

https://ecf.almd.uscourts.gov/cgi-bin/WrtOpRpt.pl?589627576618924-L_1_0-1

And reports that the case ID is 589627576618924. That's no good, and you can see the (somewhat) predictable results in the pacer_case_id field here (if you have permission):

https://www.courtlistener.com/api/rest/v3/recap/59908/

This regex needs to be tightened a bit. I think we just need to allow it to match, but only if the digits are followed by a ?. Not sure, and pretty beat from a long day. Thoughts/fixes welcome.

mlissner commented 6 years ago

Note that I caught this while QA'ing #174. I think there's a good chance that fixing this will fix that too.

johnhawkinson commented 6 years ago

Where by #34 we mean https://github.com/freelawproject/recap-chrome/pull/34 not recap#34

So, just so we're all in one place, in https://github.com/freelawproject/recap-chrome/pull/34/files/2f12b9b0843f2020c0ed6231113f9cd2fa3656ce#r155904390 you suggested:

I just noticed, but I think this still needs a tweak to remove the $ on the end (like you did above). Otherwise, a URL like this won't work:

https://ecf.almd.uscourts.gov/cgi-bin/DktRpt.pl?38115&

Which is why the $ got dropped. Unfortunately it looks like the regexp has to get more complicated. My replacement:

url.match(/\?(\d+)(?:&.*)?$/)

That is, a ? followed by a bare case number, optionally followed by an & with its parameters (as a non-capturing group).

Tests follow. Original, too strict:

/\?(\d+)$/.exec("https://ecf.almd.uscourts.gov/cgi-bin/DktRpt.pl?38115")
=> Array [ "?38115", "38115" ]
/\?(\d+)$/.exec("https://ecf.almd.uscourts.gov/cgi-bin/DktRpt.pl?38115&")
=> null
/\?(\d+)$/.exec("https://ecf.almd.uscourts.gov/cgi-bin/WrtOpRpt.pl?589627576618924-L_1_0-1")
=> null

What's currently in-tree, too relaxed:

/\?(\d+)/.exec("https://ecf.almd.uscourts.gov/cgi-bin/DktRpt.pl?38115")
=> Array [ "?38115", "38115" ]
/\?(\d+)/.exec("https://ecf.almd.uscourts.gov/cgi-bin/DktRpt.pl?38115&")
=> Array [ "?38115", "38115" ]
/\?(\d+)/.exec("https://ecf.almd.uscourts.gov/cgi-bin/WrtOpRpt.pl?589627576618924-L_1_0-1")
=> Array [ "?589627576618924", "589627576618924" ]

And as proposed:

/\?(\d+)(?:&.*)?$/.exec("https://ecf.almd.uscourts.gov/cgi-bin/DktRpt.pl?38115")
=> Array [ "?38115", "38115" ]
/\?(\d+)(?:&.*)?$/.exec("https://ecf.almd.uscourts.gov/cgi-bin/DktRpt.pl?38115&")
=> Array [ "?38115&", "38115" ]
/\?(\d+)(?:&.*)?$/.exec("https://ecf.almd.uscourts.gov/cgi-bin/WrtOpRpt.pl?589627576618924-L_1_0-1")
=> null

Note that this deliberately does not handle the https://ecf.dcd.uscourts.gov/cgi-bin/DktRpt.pl?190597;190598 form discussed at https://github.com/freelawproject/recap/issues/168#issuecomment-350550639 et supra. In part because I don't know what to do with it.

Some choices:

Add a comment noting this uncertain case, maybe a diagnostic log.
Add ; to the caseid part of the regexp, i.e. /\?([\d;]+)(?:&.*)?$/ and then split the result on ;
Maybe just take the lowest caseid or the first caseid
Maybe allow adjacent caseids only

I guess part of this depends on how the docket parser handles these.

Which reminds me of another question: when the docket parser runs and identifies new doc1 IDs, is there a reason it cannot trigger a search of the processing queue for those documents to see if they've already been uploaded? Maybe this is pointless if #61 is resolved.

mlissner commented 6 years ago

That regex works for me. I'll put it in place and mark this guy as closed.

Which reminds me of another question: when the docket parser runs and identifies new doc1 IDs, is there a reason it cannot trigger a search of the processing queue for those documents to see if they've already been uploaded?

I think that, or something similar, is the solution to #61, yes.

freelawproject / recap

Relaxed regex in getCaseNumberFromUrls matches too many links #220