Add X-Ray to CourtListener ingestion pipeline

mlissner commented 2 years ago

MVP is to:

Add X-Ray as a dependency
Check PACER docs for bad redactions during ingestion
Send an email to admins about the problem, including the text and bbox.

Eventually:

Add a boolean about the problem to the DB?
Automatically notify the filer?
Other stuff too.

mlissner commented 1 year ago

This could become a major selling point for recap.email. "And if you use the email system, we'll tell you when we have redaction failures, mitigating risk for your organization."

Alberto and I chatted about the ability to figure out the filer earlier today. It seems doable from the notification email, but we might want to put a human in the loop at least at first. Notifications from the court don't show the filer, but the other ones do. The catch is that it's just their name, not their email, and certainly not their CourtListener email nor their @recap.email email.

troglodite2 commented 1 year ago

@mlissner I'll need some sample data for testing and will look into what this means. It should take me more than a day.

mlissner commented 1 year ago

Check out the automated tests for X-Ray. There are a bunch of docs in there. The homepage for X-Ray might bring you up to speed a bit too, if you haven't made it that far already: http://free.law/projects/x-ray/

troglodite2 commented 1 year ago

Got it. I had read about your X-Ray process earlier. I had not made the connection. And this makes sense now.

maxwell-bland commented 1 year ago

Nice, thanks for looking into this Christopher!

It looked like a good route might be a microservice next to the page-count one in cl/recap/tasks.py line 354

page-count is implemented via the doctor repo: https://github.com/freelawproject/doctor/blob/770343c7ec7f95c5efffe9a7df26372ab2b3d006/doctor/urls.py#L22 https://github.com/freelawproject/doctor/blob/770343c7ec7f95c5efffe9a7df26372ab2b3d006/doctor/views.py#L172 https://github.com/freelawproject/doctor/blob/770343c7ec7f95c5efffe9a7df26372ab2b3d006/doctor/tasks.py#L129 example tests for the endpoint are at: https://github.com/freelawproject/doctor/blob/b0afcf01695893e45ba53d3c588013d10f18d1df/doctor/tests.py#L231

My thought was to have the "redaction-check" micro-service kick off a timeout limited x-ray call, since on larger documents (>100 pages), the process of checking bboxes can take a while, while returning immediately from the microservice call. (Mike may have some input).

Of course, do whatever you feel is best, and I'd be open to collaborating. This has been on my backlog for some time. 😔

troglodite2 commented 1 year ago

Current plan of attack: 1) add x-ray to doctor a "xray_pdf" service 2) have recap/tasks.py use the microservice to get the results of the xray of the document. 3) Locate other places where PDFs need to be processed and add call to the microservice there.

mlissner commented 1 year ago

That sounds pretty good, @troglodite2, and thanks so much for the pointers, @maxwell-bland. (@troglodite2, @maxwell-bland is a deep expert on this field, doing PhD-level work on hidden redactions and how to un-redact in very clever ways.)

@troglodite2 I think adding x-ray to doctor would be a perfectly reasonable way to do this, but I've also thought about providing x-ray as a service as well as integrating it into RECAP. The service could be something that folks build into their CMS's for a fee, and the RECAP integration could help you identify bad redactions before you upload things, or it could even be a standalone extension that just has that functionality.

In any case, for any of those fun projects we eventually have in our pipeline having x-ray in an infinitely scalable public API would be key. My long-term goal is to do that via AWS API Gateway, and I think I have issues about that somewhere, but if you have experience in that world, it might be a different approach to doing this, though the doctor approach works too and it'd feel right at home in there too. Then, someday, when we get our public API for this up and running, we'd just swap over to it.

Sorry, I forgot all this longer-term stuff when I said this would be an easy one!

troglodite2 commented 1 year ago

@mlissner I've done a little bit of work with AWS API Gateway.

What I see you doing is pointing me at "little" issues that are making me learn more about CL as an overall project. This is a good thing. Very much a guided tour of "here, learn this so you can be more useful".

Thank you.

mlissner commented 1 year ago

Yeah, better PM'ing on my behalf would provide tiny, well-organized stuff (like I do for the bigcases2 repo), but I haven't had time to do a good job for you, sorry. But I'm glad you're taking it in stride!

mlissner commented 1 year ago

For the API gateway, do you think it's something you could spin up? I'm not even sure what the architecture of that would be. Probably an API in front of a lambda?

troglodite2 commented 1 year ago

The AWS side of it I can do. I'm not ready to do it for CL yet. I'm still learning the pieces of CL so can't do a reasonable analysis of what would be required.

freelawproject / courtlistener

Add X-Ray to CourtListener ingestion pipeline #2098