freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
544 stars 151 forks source link

Appellate attachments are getting duplicated #2877

Open ERosendo opened 1 year ago

ERosendo commented 1 year ago

While I was debugging an issue reported by one of the users of the extension I noticed the attachment documents from docket 23-1457.

Here's an example:

Screenshot 2023-07-10 at 6 38 40 PM

The duplicated document appeared after the extension uploaded the attachment page.

I used the API to compare those documents and found that one entry has "00804758070" as the document_number and the other shows "814758070". The first one seems to be using the normalized pacer_doc_id as its document number. Here are the responses:

"resource_uri": "https://www.courtlistener.com/api/rest/v3/recap-documents/362825982/",
"id": 362825982,
"tags": [],
"absolute_url": "/docket/66980776/814758070/2/united-states-v-state-of-missouri/",
"date_created": "2023-07-07T14:46:11.107290-07:00",
"date_modified": "2023-07-10T13:57:47.884053-07:00",
"sha1": "",
"page_count": 1,
"file_size": null,
"filepath_local": null,
"filepath_ia": "",
"ia_upload_failure_count": null,
"thumbnail": null,
"thumbnail_status": 0,
"plain_text": "",
"ocr_status": null,
"date_upload": null,
"document_number": "814758070",
"attachment_number": 2,
"pacer_doc_id": "00804758071",
"is_available": false,
"is_free_on_pacer": null,
"is_sealed": null,
"document_type": 2,
"description": "CovLtrAmBrFiled"
ERosendo commented 1 year ago

@albertisfu suggested we could compare the stored HTML files for this entry. I followed his suggestion and found that the only significant difference between the first HTML file uploaded (uploaded on May 19) and the latest one (uploaded today) was the naming style of the HTML data attributes created by the extension, which makes sense because we introduced that change in version 2.0.2 and it was released on Jun 8. Here is the output of the diff command:

--- first_attachment_page_uploaded.html 
+++ latest_attachment_page_uploaded.html
@@ -28,9 +28,9 @@
                 <td align="center">1</td>
                 <td align="center"><a
                     href="https://ecf.ca8.uscourts.gov/docs1/00814759939?caseId=105396&amp;recapAttNum=1"
-                    onauxclick="return doDocPostURL('00814759939')" target="_self" data-pacer_doc_id="00804759939"
-                    data-pacer_dls_id="00814759939" data-pacer_case_id="105396" data-pacer_tab_id="68648495"
-                    data-document_number="00804759939" data-attachment_number="1"><img title="Open Document" width="13"
+                    onauxclick="return doDocPostURL('00814759939')" target="_self" data-pacer-doc-id="00804759939"
+                    data-pacer-dls-id="00814759939" data-pacer-case-id="105396" data-pacer-tab-id="715188880"
+                    data-document-number="00804759939" data-attachment-number="1"><img title="Open Document" width="13"
                       height="15" border="2" src="TransportRoom?servlet=document.gif" alt="Open document"></a>&nbsp;
                 </td>
                 <td>Amicus Brief of Gun Owners of America, Inc., et al.</td>
\ No newline at end of file
@@ -42,9 +42,9 @@
                 <td align="center">2</td>
                 <td align="center"><a
                     href="https://ecf.ca8.uscourts.gov/docs1/00814759950?caseId=105396&amp;recapAttNum=2"
-                    onauxclick="return doDocPostURL('00814759950')" target="_self" data-pacer_doc_id="00804759950"
-                    data-pacer_dls_id="00814759950" data-pacer_case_id="105396" data-pacer_tab_id="68648495"
-                    data-document_number="00804759950" data-attachment_number="2"><img title="Open Document" width="13"
+                    onauxclick="return doDocPostURL('00814759950')" target="_self" data-pacer-doc-id="00804759950"
+                    data-pacer-dls-id="00814759950" data-pacer-case-id="105396" data-pacer-tab-id="715188880"
+                    data-document-number="00804759950" data-attachment-number="2"><img title="Open Document" width="13"
                       height="15" border="2" src="TransportRoom?servlet=document.gif" alt="Open form"></a>&nbsp;</td>
                 <td>CovLtrAmBrFiled</td>
                 <td align="center">2</td>
\ No newline at end of file
@@ -87,7 +87,7 @@
          // link was done so params aren't in copied doc hyperlinks.
          // This allows user to right click & get the URL for copying doc
          // links, but still gets params back to the server as needed
-         var aWin = window.open('TransportRoom?servlet=ShowDoc&caseId=105396&dls_id='+dls+'&caseId=105396',winTarget,winOptions,false);
+         var aWin = window.open('TransportRoom?servlet=ShowDoc&pacer=i&caseId=105396&dls_id='+dls+'&caseId=105396',winTarget,winOptions,false);
          return false;
       }

\ No newline at end of file
@@ -175,7 +175,7 @@
            if (document.dktEntry.incPdfFooter.checked) {
              incFoot = 'y';
            }
-           window.location='TransportRoom?servlet=ShowDocMulti&caseId=105396&outputType=doc&d=5278888&outputForm=view&incPdfFooter='+incFoot+'&dls='+dlsIdArr.join();
+           window.location='TransportRoom?servlet=ShowDocMulti&pacer=i&caseId=105396&outputType=doc&d=5278888&outputForm=view&incPdfFooter='+incFoot+'&dls='+dlsIdArr.join();
          }
          return false;
       }
\ No newline at end of file

Here's a zip with the HTML pages: attachment pages.zip

johnhawkinson commented 1 year ago

@albertisfu suggested we could compare the stored HTML files for this entry. I followed his suggestion and found that the only significant difference between the first HTML file uploaded (uploaded on May 19) and the latest one (uploaded today) was the name styling of the HTML data attributes created by the extension, which makes sense because we introduced that change in version 2.0.2 and it was released on Jun 8.

At first I thought you were saying that the change in the naming of the data attributes caused a change in whether the leading zeros were preserved in the now data-pacer-doc-id (formerly data-pacer_doc_id) field, but that is not what you are saying (not sure why wdiff did not produce the clear output I had hoped for...). But here's the massaged wdiff showing no attribute values change other than data-pacer-tab-id:

jhawk@lrr /tmp % wdiff da[12]
[-data-attachment_number-]{+data-attachment-number+}        "1"
[-data-document_number-]
{+data-document-number+}        "00804759939"
[-data-pacer_case_id-]
{+data-pacer-case-id+}      "105396"
[-data-pacer_dls_id-]
{+data-pacer-dls-id+}       "00814759939"
[-data-pacer_doc_id-]
{+data-pacer-doc-id+}       "00804759939"
[-data-pacer_tab_id     "68648495"-]
{+data-pacer-tab-id     "715188880"+}
onauxclick          "return doDocPostURL('00814759939')"
target              "_self"

I think there are multiple contexts in which these dls/doc ids are used and they are not consistently normalized.

"Obviously" the server should be normalizing them, but probably the client should too?

mlissner commented 1 year ago

I think we usually expect the client to normalize these (though, yeah, the server should do so too, though that was less important when we were the only ones uploading things).

Are both of these uploads from the extension or could one be from a different source? I think we can check if they both were uploaded by the recap user?

mlissner commented 1 year ago

@ERosendo I'm guessing this is an extension issue, so I'm putting it on your backlog at first. Let's make this a priority since it could be impacting data until we have a fix.

albertisfu commented 1 year ago

Eduardo told me about this issue, and I reviewed it briefly.

The problem appears to be related to the document_number for appellate attachment documents.

In RECAP Documents (both district and appellate), the document number in attachments is taken from the main document.

This is not a problem in district and appellate courts that use "normal numbers." The problem arises in appellate courts that do not yet use document numbers.

The first time we parse an attachment page for an appellate document, we convert the main RD to an attachment. If there are additional attachments, they are created with the document_number assigned to that of the "parent document."

This document_number is the pacer_doc_id in courts that do not use numbers.

According to Juriscraper, when parsing an attachment page, the document_number for the parent document is the one with the lower value, as the parent document is not always preserved as attachment 1.

For example, in this case, the pacer_doc_id 006014794684 that is shown in the docket_entry on the docket sheet belongs to attachment 2 when parsing the attachment page.

{
  "attachments": [
    {
      "attachment_number": 1,
      "description": "Cover Letter",
      "pacer_doc_id": "006014794690",
      "page_count": 3
    },
    {
      "attachment_number": 2,
      "description": "opinion",
      "pacer_doc_id": "006014794684",
      "page_count": 48
    },
  ],
  "pacer_case_id": null,
  "pacer_doc_id": "006014794684",
  "pacer_seq_no": "6868069"
}

There are other cases where the document with the main pacer_doc_id remains as attachment 1.

But the problem is as follows:

When parsing the attachment page, the document_number in RECAP documents for attachments (in courts without numbers) is populated with the main pacer_doc_id.

However, when a PDF upload is received for an attachment with a different pacer_doc_id from the "main" document, the issue arises because the document_number received in the request actually belongs to the pacer_doc_id of the document being uploaded (extracted from the receipt page), not the main document. Therefore, in process_recap_pdf, the document number is updated based on the number received in the request: rd.document_number = pq.document_number

Consequently, if the attachment page is uploaded again and we search for existing RECAP documents that match these attachments, the RECAPDocument that belongs to the attachment which document_number was updated when the PDF was uploaded can't be found since now it has a different document_number which differs from the "main" document_number.

The solution seems to be to avoid updating the document_number for these documents (appellate attachments from courts that don't use numbers) when uploading the PDF. Since the main pacer_doc_id cannot be retrieved from the receipt page, Eduardo suggested normalizing the pacer_doc_id sent to CL from the extension as the document_number (by replacing the fourth digit from 1 to 0). Then, in CL, we could detect if the upload is an attachment and if the document_number received in the request matches the pacer_doc_id, we can avoid updating the document number.

mlissner commented 1 year ago

I'm sorry, I'm pretty lost...do you think you could do an example, Alberto, with realistic numbers?

albertisfu commented 1 year ago

Sure, here is an example using a real docket entry and attachment page where this issue happened.

https://www.courtlistener.com/docket/66980776/united-states-v-state-of-missouri/#entry-804759939

804759939 (pacer_doc_id: 00804759939)

May 19, 2023

BRIEF FILED - AMICUS BRIEF filed by America's Future, Conservative Legal Defense and Education Fund, Downsize DC Foundation, DownsizeDC.org, Gun Owners Foundation, Gun Owners of America, Inc., Gun Owners of California, Heller Foundation and Virginia Delegate David LaRock, w/service 05/19/2023. Length: 6,396 words. 10 COPIES OF PAPER BRIEFS (WITHOUT THE APPELLATE PDF FOOTER) FROM America's Future, Conservative Legal Defense and Education Fund, Downsize DC Foundation, DownsizeDC.org, Gun Owners Foundation, Gun Owners of America, Inc., Gun Owners of California, Heller Foundation and Virginia Delegate David LaRock due 05/24/2023 WITH certificate of service for paper briefs. [5278888] [23-1457] (HAG) [Entered: 05/19/2023 10:19 AM]

When the entry is created the related RECAPDocument is created as main document with 00804759939 as pacer_doc_id and document_number.

The related attachment page is as follows:

Screenshot 2023-07-11 at 16 09 00

attachment_number pacer_doc_id description
1 00804759939 Amicus Brief of Gun Owners of America, Inc., et al.
2 00804759950 CovLtrAmBrFiled

When this attachment page is uploaded, it's data parsed looks like:

{
   "pacer_doc_id":"00804759939",
   "pacer_case_id":"105396",
   "pacer_seq_no":"5278888",
   "attachments":[
      {
         "attachment_number":1,
         "description":"Amicus Brief of Gun Owners of America, Inc., et al.",
         "page_count":36,
         "pacer_doc_id":"00804759939"
      },
      {
         "attachment_number":2,
         "description":"CovLtrAmBrFiled",
         "page_count":2,
         "pacer_doc_id":"00804759950"
      }
   ]
}

So the main pacer_doc_id, the pacer_case_id and the court in the request are used to look for the main RECAPDocument where these attachments should be merged into.

In appellate courts that don't use numbers, the document_number is populated with the pacer_doc_id.

So when merging the attachment page, the main document is converted to an attachment and the document_number for each new RECAPDocument is populated with the main "document_number" (main pacer_doc_id) in this case: 00804759939

So attachments in DB looks as follows after being merged:

document_number pacer_doc_id attachment_number description
00804759939 00804759939 1 Amicus Brief of Gun Owners of America, Inc., et al.
00804759939 00804759950 2 CovLtrAmBrFiled

The problem arises when uploading a PDF.

The request to upload a PDF for an appellate attachment document looks like:

upload_type: 3
filepath_local: "the file"
court: ca8
pacer_case_id:105396
pacer_doc_id: 00804759950
document_number: 814759950
attachment_number: 2

In this case the upload is processed in process_recap_pdf the RECAPDocument is looked by the pacer_case_id 105396 and pacer_doc_id 00804759950, so the PDF is properly uploaded. But in this same method the document number is updated: rd.document_number = pq.document_number

In this case by 814759950 which is the document_number received in the request (extracted from the receipt page).

So after the PDF is uploaded attachments looks like in DB:

document_number pacer_doc_id attachment_number description
00804759939 00804759939 1 Amicus Brief of Gun Owners of America, Inc., et al.
814759950 00804759950 2 CovLtrAmBrFiled

Now if the attachment page is uploaded again, when merging the attachments in merge_attachment_page_data, it uses this code:

rd, created = RECAPDocument.objects.update_or_create(
                docket_entry=de,
                document_number=document_number,
                attachment_number=attachment["attachment_number"],
                document_type=RECAPDocument.ATTACHMENT,
            )

Being the document_number the parent document pacer_doc_id 00804759939. Since document_number for attachment 2 was updated for the PDF upload to 814759950, this RD is not found, so a new attachment RD document is created, leading to duplicated attachments.

We could also change the previous code to use the pacer_doc_id instead of the document_number to look for the attachment document:

rd, created = RECAPDocument.objects.update_or_create(
                docket_entry=de,
                pacer_doc_id=attachment["pacer_doc_id"],
                attachment_number=attachment["attachment_number"],
                document_type=RECAPDocument.ATTACHMENT,
            )

And this could also solve the issue, but I think we still don't want the document_number for an appellate attachment which uses the pacer_doc_id as number being updated for a PDF upload, since attachments in RECAP uses the same document_number as the "main" document.

Let me know if you need more details about this issue.

albertisfu commented 1 year ago

@mlissner I've updated the previous comment with a detailed example.

mlissner commented 1 year ago

Thanks for the details that made a world of difference.

We could also change the previous code to use the pacer_doc_id instead of the document_number to look for the attachment document

That seems fine and helpful to me, sure.

But:

I think we still don't want the document_number for an appellate attachment being updated for a PDF upload,

I think this is the main point. Both the client and the server should be normalizing any numbers that need normalization.

@ERosendo said that:

ProcessingQueue model uses a BigIntegerField to store the document number

That means no leading zeros unless we update that model. We can certainly do that if that's the best solution, but in the meantime, we should decide to:

Does that all sound right to you guys?

ttys0dev commented 11 months ago

That means no leading zeros unless we update that model. We can certainly do that if that's the best solution, but in the meantime, we should decide to:

  • Normalize the fourth digit to 1 or 0, whichever is correct, I forget.
  • Decide on leading zeros or not, and then make sure we have them (or don't)

Does that all sound right to you guys?

I think we need the leading zeros as they are important if we need to map/validate docid prefixes with courtid's, also without leading zeros normalizing the 4th digit is problematic since the leading zeros are significant in regards to the 4th digit position.