freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
550 stars 151 forks source link

feat(recap.mergers): Update PACER attachment processing #4665

Open flooie opened 2 weeks ago

flooie commented 2 weeks ago

@ERosendo

This change attempts to address the blackhole that is doppelgänger criminal attachments.

In cases where we have "doppelgänger" dockets its possible that the pacer case id is filtering out the docket that is available for the attachment.

flooie commented 2 weeks ago

Should fix issue #4664

mlissner commented 1 week ago

If there is a reliable way to detect which types of dockets can have the doppelgänger issue, we could use that condition to only apply that approach there.

Not that I know of, unfortunately.

I just put this onto your backlog for the current sprint. Can you give a size estimate for it, please?

albertisfu commented 1 week ago

I just put this onto your backlog for the current sprint. Can you give a size estimate for it, please?

Sure, just regarding this comment so we can get the right size:

Also, a similar approach would be required for PDF uploads, which seem to have the same issue, since the PDF receipt page appears to use the same incorrect pacer_case_id that can be found in the attachment page for the upload.

Should we also fix the issue with PDF uploads in this PR, or should we open a different PR to address that issue after this one for attachments is completed?

Also I see you put this in the progress column. Does that mean this is the task I should work on next, even though it has a P2 priority while there are other P1 tasks in the TO DO column?

mlissner commented 1 week ago

Also I see you put this in the progress column. Does that mean this is the task I should work on next, even though it has a P2 priority while there are other P1 tasks in the TO DO column?

I think PR's always take priority, except over P0's, which I see as "something is burning".

Should we also fix the issue with PDF uploads in this PR, or should we open a different PR to address that issue after this one for attachments is completed?

I don't know. The idea of that one is to duplicate PDFs across dockets when we get them? So that if we get a pacer_case_id, pacer_doc_id, and court_id, we ignore the pacer_case_id, and just make the copy? Seems easy enough, I suppose and I think it's a good step for doppelgänger.

mlissner commented 1 week ago

What's the size difference between the two? Maybe we do a doppelgänger sprint and fix the damned thing, and this waits, or maybe it's easy and we get it done.

Would you have space for both solutions in this sprint?

albertisfu commented 1 week ago

I think PR's always take priority, except over P0's, which I see as "something is burning".

Correct, just in this case the solution required to implement would be completely different as the one in the PR.

I don't know. The idea of that one is to duplicate PDFs across dockets when we get them? So that if we get a pacer_case_id, pacer_doc_id, and court_id, we ignore the pacer_case_id, and just make the copy? Seems easy enough, I suppose and I think it's a good step for doppelgänger.

Yes, that's correct. The idea I have, though I still need to analyze it further to ensure it doesn’t interfere with other uploads, is that as a first step, we always ignore the pacer_case_id if we encounter a RECAPDocument.MultipleObjectsReturned. Then, we check if the RDs belong to different dockets and merge the attachment pages and PDFs across all of them.

If we find duplicate RDs within the same docket, we apply the current logic to select the best RD and clean up duplicates.

This approach would essentially make the pacer_case_id parameter useless. That’s why I was concerned about finding a way to detect potential doppelgängers and only apply the logic in those cases. So now my question is: Is it possible to have a document with the same pacer_doc_id in a court where the documents aren’t the same and the cases aren’t of the doppelgänger type? I think we haven’t seen anything like that, but I just want to be sure.

What's the size difference between the two? Maybe we do a doppelgänger sprint and fix the damned thing, and this waits, or maybe it's easy and we get it done.

The first one, which belongs to this PR, will focus on merging the attachment page for a document across all related doppelgänger cases.

The other one, related to PDF uploads, will focus on merging PDFs for both main documents and attachments across all related doppelgänger cases.

So I think it’s best to have a separate issue and PR for the PDFs.

I’d say each of both tasks are medium-sized, as we need to analyze the best way to adjust the current merge code to ensure it works for all possible sources from which we receive attachments and PDF uploads.

Would you have space for both solutions in this sprint?

I think we have space for at least the attachments one. And if it's not too problematic, we could complete the PDF one as well. However, I think it would be better to leave them for the end of the sprint, after we finish the API-related issues, to ensure the API priority is completed.

mlissner commented 1 week ago

Is it possible to have a document with the same pacer_doc_id in a court where the documents aren’t the same and the cases aren’t of the doppelgänger type? I think we haven’t seen anything like that, but I just want to be sure.

I wondered the same thing. I think it's OK and we haven't seen anything like that that I know of.

Sounds great about prioritization. If this makes it in, cool. If not, it's tricky stuff and not a priority, so that's fine too!

Thank you.