freelawproject / doctor

A microservice for document conversion at scale
https://free.law/projects/doctor
BSD 2-Clause "Simplified" License
54 stars 14 forks source link

159 Fixes index error when a document number is not found #160

Closed albertisfu closed 1 year ago

albertisfu commented 1 year ago

Well, 4 months ago when this new service was released we thought it was a good idea to fail it loud if a document number was not found in the PDF header so we could check those PDFs and update the regex in case we missed a document number string.

I checked errors on sentry and most of them are related to weird PDF headers like:

CCaassee 2211--22009955,, DDooccuummeenntt 19090, ,0 011/0/044/2/2002233, ,3 3444466266138, ,P Paaggee11 o of f2 2 And there is one where the header doesn't contain a document number at all: Appellate Case: 22-1801 Page: 1 Date Filed: 01/19/2023 Entry ID: 5237514

We currently recognize the following ones, which we have seen so far: Document:, Document, Doc:, DktEntry:

So, since we haven't found any new document number strings to parse, I changed the logic so that when no document number is found, we just return an empty string instead of failing out loud.

Let me know what you think.

mlissner commented 1 year ago

Yeah, OK. Seems like we're doing as much as we can without going crazy.

mlissner commented 1 year ago

Does CourtListener fail elegantly when we can't extract a number and return ""?

albertisfu commented 1 year ago

Does CourtListener fail elegantly when we can't extract a number and return ""?

Yeah, if a document number is not found in a PDF, we fall back to the download confirmation page to get it, if the number can't be found on the download confirmation page either, the docket entry is added without a number.