freelawproject / doctor

A microservice for document conversion at scale
https://free.law/projects/doctor
BSD 2-Clause "Simplified" License
54 stars 14 forks source link

Doctor not identifying PACER headers very well #139

Closed mlissner closed 1 year ago

mlissner commented 2 years ago

I noticed a couple cases today where the OCR didn't trigger but should have:

https://www.courtlistener.com/docket/63348437/1/navarro-v-pelosi/

https://www.courtlistener.com/docket/5319662/1/unicorn-investment-bank-v-kuruvilla/

In both cases, the text that's extracted looks like:

Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 1 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 2 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 3 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 4 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 5 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 6 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 7 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 8 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 9 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 10 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 11 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 12 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 13 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 14 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 15 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 16 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 17 of 18
Case 3:05-cv-00281-GCM Document 1 Filed 06/17/05 Page 18 of 18

We used to have a kinda flakey regex for this, but perhaps it's not working? Or needs a tweak?