Open ERosendo opened 1 year ago
@albertisfu suggested we could compare the stored HTML files for this entry. I followed his suggestion and found that the only significant difference between the first HTML file uploaded (uploaded on May 19) and the latest one (uploaded today) was the naming style of the HTML data attributes
created by the extension, which makes sense because we introduced that change in version 2.0.2 and it was released on Jun 8. Here is the output of the diff command:
--- first_attachment_page_uploaded.html
+++ latest_attachment_page_uploaded.html
@@ -28,9 +28,9 @@
<td align="center">1</td>
<td align="center"><a
href="https://ecf.ca8.uscourts.gov/docs1/00814759939?caseId=105396&recapAttNum=1"
- onauxclick="return doDocPostURL('00814759939')" target="_self" data-pacer_doc_id="00804759939"
- data-pacer_dls_id="00814759939" data-pacer_case_id="105396" data-pacer_tab_id="68648495"
- data-document_number="00804759939" data-attachment_number="1"><img title="Open Document" width="13"
+ onauxclick="return doDocPostURL('00814759939')" target="_self" data-pacer-doc-id="00804759939"
+ data-pacer-dls-id="00814759939" data-pacer-case-id="105396" data-pacer-tab-id="715188880"
+ data-document-number="00804759939" data-attachment-number="1"><img title="Open Document" width="13"
height="15" border="2" src="TransportRoom?servlet=document.gif" alt="Open document"></a>
</td>
<td>Amicus Brief of Gun Owners of America, Inc., et al.</td>
\ No newline at end of file
@@ -42,9 +42,9 @@
<td align="center">2</td>
<td align="center"><a
href="https://ecf.ca8.uscourts.gov/docs1/00814759950?caseId=105396&recapAttNum=2"
- onauxclick="return doDocPostURL('00814759950')" target="_self" data-pacer_doc_id="00804759950"
- data-pacer_dls_id="00814759950" data-pacer_case_id="105396" data-pacer_tab_id="68648495"
- data-document_number="00804759950" data-attachment_number="2"><img title="Open Document" width="13"
+ onauxclick="return doDocPostURL('00814759950')" target="_self" data-pacer-doc-id="00804759950"
+ data-pacer-dls-id="00814759950" data-pacer-case-id="105396" data-pacer-tab-id="715188880"
+ data-document-number="00804759950" data-attachment-number="2"><img title="Open Document" width="13"
height="15" border="2" src="TransportRoom?servlet=document.gif" alt="Open form"></a> </td>
<td>CovLtrAmBrFiled</td>
<td align="center">2</td>
\ No newline at end of file
@@ -87,7 +87,7 @@
// link was done so params aren't in copied doc hyperlinks.
// This allows user to right click & get the URL for copying doc
// links, but still gets params back to the server as needed
- var aWin = window.open('TransportRoom?servlet=ShowDoc&caseId=105396&dls_id='+dls+'&caseId=105396',winTarget,winOptions,false);
+ var aWin = window.open('TransportRoom?servlet=ShowDoc&pacer=i&caseId=105396&dls_id='+dls+'&caseId=105396',winTarget,winOptions,false);
return false;
}
\ No newline at end of file
@@ -175,7 +175,7 @@
if (document.dktEntry.incPdfFooter.checked) {
incFoot = 'y';
}
- window.location='TransportRoom?servlet=ShowDocMulti&caseId=105396&outputType=doc&d=5278888&outputForm=view&incPdfFooter='+incFoot+'&dls='+dlsIdArr.join();
+ window.location='TransportRoom?servlet=ShowDocMulti&pacer=i&caseId=105396&outputType=doc&d=5278888&outputForm=view&incPdfFooter='+incFoot+'&dls='+dlsIdArr.join();
}
return false;
}
\ No newline at end of file
Here's a zip with the HTML pages: attachment pages.zip
@albertisfu suggested we could compare the stored HTML files for this entry. I followed his suggestion and found that the only significant difference between the first HTML file uploaded (uploaded on May 19) and the latest one (uploaded today) was the name styling of the
HTML data attributes
created by the extension, which makes sense because we introduced that change in version 2.0.2 and it was released on Jun 8.
At first I thought you were saying that the change in the naming of the data attributes caused a change in whether the leading zeros were preserved in the now data-pacer-doc-id
(formerly data-pacer_doc_id
) field, but that is not what you are saying (not sure why wdiff did not produce the clear output I had hoped for...). But here's the massaged wdiff showing no attribute values change other than data-pacer-tab-id
:
jhawk@lrr /tmp % wdiff da[12]
[-data-attachment_number-]{+data-attachment-number+} "1"
[-data-document_number-]
{+data-document-number+} "00804759939"
[-data-pacer_case_id-]
{+data-pacer-case-id+} "105396"
[-data-pacer_dls_id-]
{+data-pacer-dls-id+} "00814759939"
[-data-pacer_doc_id-]
{+data-pacer-doc-id+} "00804759939"
[-data-pacer_tab_id "68648495"-]
{+data-pacer-tab-id "715188880"+}
onauxclick "return doDocPostURL('00814759939')"
target "_self"
I think there are multiple contexts in which these dls/doc ids are used and they are not consistently normalized.
"Obviously" the server should be normalizing them, but probably the client should too?
I think we usually expect the client to normalize these (though, yeah, the server should do so too, though that was less important when we were the only ones uploading things).
Are both of these uploads from the extension or could one be from a different source? I think we can check if they both were uploaded by the recap
user?
@ERosendo I'm guessing this is an extension issue, so I'm putting it on your backlog at first. Let's make this a priority since it could be impacting data until we have a fix.
Eduardo told me about this issue, and I reviewed it briefly.
The problem appears to be related to the document_number
for appellate attachment documents.
In RECAP Documents (both district and appellate), the document number in attachments is taken from the main document.
This is not a problem in district and appellate courts that use "normal numbers." The problem arises in appellate courts that do not yet use document numbers.
The first time we parse an attachment page for an appellate document, we convert the main RD to an attachment. If there are additional attachments, they are created with the document_number
assigned to that of the "parent document."
This document_number
is the pacer_doc_id
in courts that do not use numbers.
According to Juriscraper, when parsing an attachment page, the document_number
for the parent document is the one with the lower value, as the parent document is not always preserved as attachment 1.
For example, in this case, the pacer_doc_id
006014794684
that is shown in the docket_entry
on the docket sheet belongs to attachment 2 when parsing the attachment page.
{
"attachments": [
{
"attachment_number": 1,
"description": "Cover Letter",
"pacer_doc_id": "006014794690",
"page_count": 3
},
{
"attachment_number": 2,
"description": "opinion",
"pacer_doc_id": "006014794684",
"page_count": 48
},
],
"pacer_case_id": null,
"pacer_doc_id": "006014794684",
"pacer_seq_no": "6868069"
}
There are other cases where the document with the main
pacer_doc_id
remains as attachment 1
.
But the problem is as follows:
When parsing the attachment page, the document_number
in RECAP documents for attachments (in courts without numbers) is populated with the main pacer_doc_id
.
However, when a PDF upload is received for an attachment with a different pacer_doc_id
from the "main" document, the issue arises because the document_number
received in the request actually belongs to the pacer_doc_id
of the document being uploaded (extracted from the receipt page), not the main document. Therefore, in process_recap_pdf
, the document number is updated based on the number received in the request:
rd.document_number = pq.document_number
Consequently, if the attachment page is uploaded again and we search for existing RECAP documents that match these attachments, the RECAPDocument that belongs to the attachment which document_number
was updated when the PDF was uploaded can't be found since now it has a different document_number
which differs from the "main" document_number.
The solution seems to be to avoid updating the document_number
for these documents (appellate attachments from courts that don't use numbers) when uploading the PDF. Since the main pacer_doc_id
cannot be retrieved from the receipt page, Eduardo suggested normalizing the pacer_doc_id
sent to CL from the extension as the document_number
(by replacing the fourth digit from 1 to 0). Then, in CL, we could detect if the upload is an attachment and if the document_number
received in the request matches the pacer_doc_id
, we can avoid updating the document number.
I'm sorry, I'm pretty lost...do you think you could do an example, Alberto, with realistic numbers?
Sure, here is an example using a real docket entry and attachment page where this issue happened.
https://www.courtlistener.com/docket/66980776/united-states-v-state-of-missouri/#entry-804759939
804759939 (pacer_doc_id: 00804759939)
May 19, 2023
BRIEF FILED - AMICUS BRIEF filed by America's Future, Conservative Legal Defense and Education Fund, Downsize DC Foundation, DownsizeDC.org, Gun Owners Foundation, Gun Owners of America, Inc., Gun Owners of California, Heller Foundation and Virginia Delegate David LaRock, w/service 05/19/2023. Length: 6,396 words. 10 COPIES OF PAPER BRIEFS (WITHOUT THE APPELLATE PDF FOOTER) FROM America's Future, Conservative Legal Defense and Education Fund, Downsize DC Foundation, DownsizeDC.org, Gun Owners Foundation, Gun Owners of America, Inc., Gun Owners of California, Heller Foundation and Virginia Delegate David LaRock due 05/24/2023 WITH certificate of service for paper briefs. [5278888] [23-1457] (HAG) [Entered: 05/19/2023 10:19 AM]
When the entry is created the related RECAPDocument
is created as main document
with 00804759939
as pacer_doc_id
and document_number
.
The related attachment page is as follows:
attachment_number | pacer_doc_id | description |
---|---|---|
1 | 00804759939 | Amicus Brief of Gun Owners of America, Inc., et al. |
2 | 00804759950 | CovLtrAmBrFiled |
When this attachment page is uploaded, it's data parsed looks like:
{
"pacer_doc_id":"00804759939",
"pacer_case_id":"105396",
"pacer_seq_no":"5278888",
"attachments":[
{
"attachment_number":1,
"description":"Amicus Brief of Gun Owners of America, Inc., et al.",
"page_count":36,
"pacer_doc_id":"00804759939"
},
{
"attachment_number":2,
"description":"CovLtrAmBrFiled",
"page_count":2,
"pacer_doc_id":"00804759950"
}
]
}
So the main pacer_doc_id
, the pacer_case_id
and the court
in the request are used to look for the main RECAPDocument
where these attachments should be merged into.
In appellate courts that don't use numbers, the document_number
is populated with the pacer_doc_id
.
So when merging the attachment page, the main document is converted to an attachment and the document_number
for each new RECAPDocument is populated with the main "document_number" (main pacer_doc_id
) in this case: 00804759939
So attachments in DB looks as follows after being merged:
document_number | pacer_doc_id | attachment_number | description |
---|---|---|---|
00804759939 | 00804759939 | 1 | Amicus Brief of Gun Owners of America, Inc., et al. |
00804759939 | 00804759950 | 2 | CovLtrAmBrFiled |
The problem arises when uploading a PDF.
The request to upload a PDF for an appellate attachment document looks like:
upload_type: 3
filepath_local: "the file"
court: ca8
pacer_case_id:105396
pacer_doc_id: 00804759950
document_number: 814759950
attachment_number: 2
In this case the upload is processed in process_recap_pdf
the RECAPDocument
is looked by the pacer_case_id
105396
and pacer_doc_id
00804759950
, so the PDF is properly uploaded.
But in this same method the document number is updated:
rd.document_number = pq.document_number
In this case by 814759950
which is the document_number
received in the request (extracted from the receipt page).
So after the PDF is uploaded attachments looks like in DB:
document_number | pacer_doc_id | attachment_number | description |
---|---|---|---|
00804759939 | 00804759939 | 1 | Amicus Brief of Gun Owners of America, Inc., et al. |
814759950 | 00804759950 | 2 | CovLtrAmBrFiled |
Now if the attachment page is uploaded again, when merging the attachments in merge_attachment_page_data
, it uses this code:
rd, created = RECAPDocument.objects.update_or_create(
docket_entry=de,
document_number=document_number,
attachment_number=attachment["attachment_number"],
document_type=RECAPDocument.ATTACHMENT,
)
Being the document_number
the parent document pacer_doc_id
00804759939
. Since document_number
for attachment 2 was updated for the PDF upload to 814759950
, this RD is not found, so a new attachment RD document is created, leading to duplicated attachments.
We could also change the previous code to use the pacer_doc_id
instead of the document_number
to look for the attachment document:
rd, created = RECAPDocument.objects.update_or_create(
docket_entry=de,
pacer_doc_id=attachment["pacer_doc_id"],
attachment_number=attachment["attachment_number"],
document_type=RECAPDocument.ATTACHMENT,
)
And this could also solve the issue, but I think we still don't want the document_number
for an appellate attachment which uses the pacer_doc_id
as number being updated for a PDF upload, since attachments in RECAP uses the same document_number
as the "main" document.
Let me know if you need more details about this issue.
@mlissner I've updated the previous comment with a detailed example.
Thanks for the details that made a world of difference.
We could also change the previous code to use the pacer_doc_id instead of the document_number to look for the attachment document
That seems fine and helpful to me, sure.
But:
I think we still don't want the document_number for an appellate attachment being updated for a PDF upload,
I think this is the main point. Both the client and the server should be normalizing any numbers that need normalization.
@ERosendo said that:
ProcessingQueue model uses a BigIntegerField to store the document number
That means no leading zeros unless we update that model. We can certainly do that if that's the best solution, but in the meantime, we should decide to:
Does that all sound right to you guys?
That means no leading zeros unless we update that model. We can certainly do that if that's the best solution, but in the meantime, we should decide to:
- Normalize the fourth digit to 1 or 0, whichever is correct, I forget.
- Decide on leading zeros or not, and then make sure we have them (or don't)
Does that all sound right to you guys?
I think we need the leading zeros as they are important if we need to map/validate docid prefixes with courtid's, also without leading zeros normalizing the 4th digit is problematic since the leading zeros are significant in regards to the 4th digit position.
While I was debugging an issue reported by one of the users of the extension I noticed the attachment documents from docket 23-1457.
Here's an example:
The duplicated document appeared after the extension uploaded the attachment page.
I used the API to compare those documents and found that one entry has "00804758070" as the
document_number
and the other shows "814758070". The first one seems to be using the normalizedpacer_doc_id
as its document number. Here are the responses: