freelawproject / doctor

A microservice for document conversion at scale
https://free.law/projects/doctor
BSD 2-Clause "Simplified" License
54 stars 14 forks source link

Get PACER document number from PDF #147

Closed albertisfu closed 2 years ago

albertisfu commented 2 years ago

This is a new service to extract the document number from appellate PDFs.

It supports different header stamp versions from the following appellate courts:

ca1, ca2, ca3, ca4, ca5, ca6, ca7, ca9, ca10 and cafc

As described in https://github.com/freelawproject/courtlistener/pull/2257 for ca8, ca11 and cadc is not possible to get the document number from the PDF. So for those courts, we're going to get it directly from the receipt page.

The task is designed to fail it loud in case we can not get the document number from the PDF instead of returning None so that we can see the error on Sentry and add new header stamp patterns.

If this looks good I'll add the version bump so the new image is available for Courtlistener to pass the tests.

albertisfu commented 2 years ago

Thank you! I've applied these changes and also added the version bump and documentation for the new service.

mlissner commented 2 years ago

Looks good. Just the one question above remaining. If it's all good, please merge.