freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
340 stars 98 forks source link

DocketReport ERROR_STRINGS "This case was administratively closed" too aggressive. #1025

Closed nadahlberg closed 1 month ago

nadahlberg commented 1 month ago

I think the ERROR_STRING for 'This case was administratively closed' may be causing some docket htmls to be incorrectly flagged as invalid.

Are there any other heuristics that might be used to identify whatever error page this is being used to detect?

Here's are some cases that include the phrase on the docket:

lamd 3:17-cv-01352 nysd 1:16-cv-01697 nynd 9:17-cv-01287 nynd 9:16-cv-00251 nynd 9:17-cv-01331 flmd 8:17-cv-02878 pawd 2:17-cv-01035 pawd 2:17-cv-00020 nynd 9:17-cv-00627 cod 1:17-cv-02042 nynd 9:16-cv-01369 nynd 9:17-cv-00938 ctd 3:17-cv-00518 nynd 9:17-cv-00861 flmd 8:16-cv-00088 ctd 3:16-cv-00890 nynd 9:16-cv-00757 flsd 1:17-cv-21992 nynd 9:17-cv-01051 cacd 2:16-cv-05381 nynd 9:17-cv-00661 cod 1:16-cv-02648 nynd 9:16-cv-00101 nynd 9:17-cv-00710 nynd 9:17-cv-00939 gand 1:18-cv-03187 nysd 1:17-cv-08171 pawd 2:16-cv-01470 nynd 9:17-cv-01216 txnd 4:18-cv-00245 txwd 5:20-cv-01380 txnd 3:16-cv-00643 ilcd 4:16-cv-04127 gand 1:17-cv-00361 pawd 1:16-cv-00050 nynd 9:17-cv-01295 txed 4:20-cv-00168 nynd 9:16-cv-00087 cod 1:17-cv-02174 pawd 2:17-cv-00614 nynd 9:17-cv-00058 flmd 6:16-cv-00486

It looks like this phrase occasionally appears in the body of the entry descriptions:

error_string_example
mlissner commented 1 month ago

Interesting, @nadahlberg, thanks for reporting this. Let me see if I can figure out who added it and why.

mlissner commented 1 month ago

Looks like the answer is...me! I added it back in 2017 in 8de4670aab9242b2d2c2b514e7c17f65f75ee990.

I think we could make this more precise by tweaking the ERROR_STRING to have asterisks or perhaps to remove the error string altogether.

The string in the test case is:

*** This case was administratively closed.***

Want to try that Nathan? If not, I can put it on a backlog to tweak this.

nadahlberg commented 1 month ago

Thanks @mlissner! I didn't get any hits for this, so I'm comfortable using it for my purposes. Testing on ~40M entries for context.

Is a one-line PR helpful or did you want to diy / do more due diligence?

mlissner commented 1 month ago

Yeah, a one liner should do it! Maybe worth adding a test that should be parsed properly so we don't regress?