Closed mlissner closed 6 years ago
Note that I caught this while QA'ing #174. I think there's a good chance that fixing this will fix that too.
Where by #34
we mean https://github.com/freelawproject/recap-chrome/pull/34 not recap#34
So, just so we're all in one place, in https://github.com/freelawproject/recap-chrome/pull/34/files/2f12b9b0843f2020c0ed6231113f9cd2fa3656ce#r155904390 you suggested:
I just noticed, but I think this still needs a tweak to remove the $ on the end (like you did above). Otherwise, a URL like this won't work:
Which is why the $
got dropped.
Unfortunately it looks like the regexp has to get more complicated. My replacement:
url.match(/\?(\d+)(?:&.*)?$/)
That is, a ?
followed by a bare case number, optionally followed by an &
with its parameters (as a non-capturing group).
Tests follow. Original, too strict:
/\?(\d+)$/.exec("https://ecf.almd.uscourts.gov/cgi-bin/DktRpt.pl?38115")
=> Array [ "?38115", "38115" ]
/\?(\d+)$/.exec("https://ecf.almd.uscourts.gov/cgi-bin/DktRpt.pl?38115&")
=> null
/\?(\d+)$/.exec("https://ecf.almd.uscourts.gov/cgi-bin/WrtOpRpt.pl?589627576618924-L_1_0-1")
=> null
What's currently in-tree, too relaxed:
/\?(\d+)/.exec("https://ecf.almd.uscourts.gov/cgi-bin/DktRpt.pl?38115")
=> Array [ "?38115", "38115" ]
/\?(\d+)/.exec("https://ecf.almd.uscourts.gov/cgi-bin/DktRpt.pl?38115&")
=> Array [ "?38115", "38115" ]
/\?(\d+)/.exec("https://ecf.almd.uscourts.gov/cgi-bin/WrtOpRpt.pl?589627576618924-L_1_0-1")
=> Array [ "?589627576618924", "589627576618924" ]
And as proposed:
/\?(\d+)(?:&.*)?$/.exec("https://ecf.almd.uscourts.gov/cgi-bin/DktRpt.pl?38115")
=> Array [ "?38115", "38115" ]
/\?(\d+)(?:&.*)?$/.exec("https://ecf.almd.uscourts.gov/cgi-bin/DktRpt.pl?38115&")
=> Array [ "?38115&", "38115" ]
/\?(\d+)(?:&.*)?$/.exec("https://ecf.almd.uscourts.gov/cgi-bin/WrtOpRpt.pl?589627576618924-L_1_0-1")
=> null
Note that this deliberately does not handle the https://ecf.dcd.uscourts.gov/cgi-bin/DktRpt.pl?190597;190598
form discussed at https://github.com/freelawproject/recap/issues/168#issuecomment-350550639 et supra. In part because I don't know what to do with it.
Some choices:
;
to the caseid part of the regexp, i.e. /\?([\d;]+)(?:&.*)?$/
and then split the result on ;
I guess part of this depends on how the docket parser handles these.
Which reminds me of another question: when the docket parser runs and identifies new doc1 IDs, is there a reason it cannot trigger a search of the processing queue for those documents to see if they've already been uploaded? Maybe this is pointless if #61 is resolved.
That regex works for me. I'll put it in place and mark this guy as closed.
Which reminds me of another question: when the docket parser runs and identifies new doc1 IDs, is there a reason it cannot trigger a search of the processing queue for those documents to see if they've already been uploaded?
I think that, or something similar, is the solution to #61, yes.
The regex we just pulled in PR #34 relaxes a regex from:
To:
Alas, this now matches URLs like:
And reports that the case ID is
589627576618924
. That's no good, and you can see the (somewhat) predictable results in thepacer_case_id
field here (if you have permission):This regex needs to be tightened a bit. I think we just need to allow it to match, but only if the digits are followed by a
?
. Not sure, and pretty beat from a long day. Thoughts/fixes welcome.