Open johnlabonte opened 1 year ago
Thanks for creating this issue.
The extension adds the warning because it appears that clicking the view selected
button is taking you to a page to download multiple PDF documents(There is an indication of this behavior in the title of the page, which includes the phrase "Multiple PDF Documents"), even when only one file is selected.
Here are screenshots of the download page for a single document and the download page for multiple documents:
Note: This single-pdf download page can be accessed by clicking the document icon
next to the number 1.
This bug feels valid to me. If somebody clicks the "View Documents" button with only one item selected, we should let them upload, right? I think we just need to tweak our detector to make this work a bit better?
@mlissner You're right, we should check the number of documents on the page and let users upload files when they're trying to retrieve a single item.
It's unclear to me why this error started showing up. Previously, multiple documents selected and at least downloaded were separate .PDF files within the one .ZIP file. So there shouldn't be an error dividing them under those circumstances. Viewing multiple documents, on the other hand, I don't know how PACER and/or the browser would handle that request because that's not how I use PACER. It might well be that selecting "view" multiple results in a singular concatenated .PDF file. Nevertheless, each docket entry should still have its respective docket number on it (e.g., 30-0 plus attachments to it, like 30-1 and 30-2) together perhaps also with page number for each. Parsing a single document (after scraping and obtaining the corresponding docket numbers from the web page) should be possible so that RECAP can separate one monolithically viewed document into each separate docket filing.
It's unclear to me why this error started showing up. Previously, multiple documents selected and at least downloaded were separate .PDF files within the one .ZIP file.
That still works, in fact. Zips work as they always have. Combined docs never have, so the only change is that now we're warning (too much) about it.
Nevertheless, each docket entry should still have its respective docket number on it (e.g., 30-0 plus attachments to it, like 30-1 and 30-2) together perhaps also with page number for each. Parsing a single document (after scraping and obtaining the corresponding docket numbers from the web page) should be possible so that RECAP can separate one monolithically viewed document into each separate docket filing.
Thanks for the comment. Alas, this isn't as easy as it may seem:
It's one of those things we decided was too hard, but we do have a breakthrough over in #347 that should make it possible!
From what I've seen of the actual merged pages, if you have the index page (or even if you don't), and you have the pdf, the watermark at the top of the page will tell you which subdocument you have and which page of it you are on. So this error message not only shows up on single file downloads, but seems like it could be worked around.
Unfortunately, the watermarks on the PDFs aren't reliable. Some users disable them, and others upload documents that they re-purposed from other cases without removing the watermark. The result is that a watermark is usually fine, but can be missing, wrong, or duplicated.
The solution is over here though: https://github.com/freelawproject/recap/issues/347
Here's an example of a watermark I downloaded a couple of days ago.
Is there some reason not to try to read the merged files and look for the watermarks and split them? Seems like the benefit to getting it right exceeds the cost of trying and throwing it out
Do we have an example of a mis-identified watermark?
I've seen filings without a clerk/docket stamp. I have also seen filings with two such stamps--but it seems to me that the later-in-time one should govern as a rule since it's seemingly impossible to have a future stamp on a current filing, but not unheard of to re-file old, previously-stamped documents.
On Wed, Dec 6, 2023 at 2:10 PM Gregory Bronner @.***> wrote:
image.png (view on web) https://github.com/freelawproject/recap/assets/1834828/1ba531ff-a28a-4725-b713-2fed0609310a Here's an example of a watermark I downloaded a couple of days ago.
Is there some reason not to try to read the merged files and look for the watermarks and split them? Seems like the benefit to getting it right exceeds the cost of trying and throwing it out
Do we have an example of a mis-identified watermark?
— Reply to this email directly, view it on GitHub https://github.com/freelawproject/recap/issues/349#issuecomment-1843533821, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANLSNFKRWSJQMM2J5FGIRLDYIC7MFAVCNFSM6AAAAAA27GZAKWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBTGUZTGOBSGE . You are receiving this because you commented.Message ID: @.***>
I don't have an example, but I still think the better and easier solution is https://github.com/freelawproject/recap/issues/347. Good point about selecting the latter date when encountering duplicates.
@ERosendo, can you please give me a size estimate for analyzing the remaining task here (I haven't read through it in a year or so)?
can you please give me a size estimate for analyzing the remaining task here (I haven't read through it in a year or so)?
We can likely close this issue once pull request #402 is merged. The original problem was that the extension incorrectly labeled single-document download pages as "Multiple PDF" and displayed an unnecessary warning. This happened because we were relying on page elements rather than counting the documents. PR #402 addresses this by refining the logic we're using to identify these pages and avoid the warning.
@mlissner There's a separate issue about uploading combined PDFs that was discussed earlier. Are you referring to that issue when you mention the 'remaining task here'?
Sounds great. I don't know what I was referring to, so I think we're OK here. I've put this on the current sprint so it can get wrapped up as part of this one and so it's not on your old board.
When navigating to https://ecf.ca5.uscourts.gov/docs1/00506701831?caseId=213057 and deselecting the multiple documents to only the petition, I still receive the error that there are multiple pages and it cannot be split so therefor cannot be uploaded. This is incorrect as I was only accessing one attachment from the group.
"This document will not be uploaded to the RECAP Archive because the extension has detected that this page may return a combined PDF and consistently splitting these files in a proper manner is not possible for now."