Incorrectly identified split pages

johnlabonte commented 1 year ago

When navigating to https://ecf.ca5.uscourts.gov/docs1/00506701831?caseId=213057 and deselecting the multiple documents to only the petition, I still receive the error that there are multiple pages and it cannot be split so therefor cannot be uploaded. This is incorrect as I was only accessing one attachment from the group.

2023-07-31 22_33_58-Document

2023-07-31 22_31_19-Download Confirmation

"This document will not be uploaded to the RECAP Archive because the extension has detected that this page may return a combined PDF and consistently splitting these files in a proper manner is not possible for now."

ERosendo commented 1 year ago

Thanks for creating this issue.

The extension adds the warning because it appears that clicking the view selected button is taking you to a page to download multiple PDF documents(There is an indication of this behavior in the title of the page, which includes the phrase "Multiple PDF Documents"), even when only one file is selected.

Here are screenshots of the download page for a single document and the download page for multiple documents:

Single PDF:

Note: This single-pdf download page can be accessed by clicking the document icon next to the number 1.

Multiple PDF

mlissner commented 1 year ago

This bug feels valid to me. If somebody clicks the "View Documents" button with only one item selected, we should let them upload, right? I think we just need to tweak our detector to make this work a bit better?

ERosendo commented 1 year ago

@mlissner You're right, we should check the number of documents on the page and let users upload files when they're trying to retrieve a single item.

gcklema commented 1 year ago

It's unclear to me why this error started showing up. Previously, multiple documents selected and at least downloaded were separate .PDF files within the one .ZIP file. So there shouldn't be an error dividing them under those circumstances. Viewing multiple documents, on the other hand, I don't know how PACER and/or the browser would handle that request because that's not how I use PACER. It might well be that selecting "view" multiple results in a singular concatenated .PDF file. Nevertheless, each docket entry should still have its respective docket number on it (e.g., 30-0 plus attachments to it, like 30-1 and 30-2) together perhaps also with page number for each. Parsing a single document (after scraping and obtaining the corresponding docket numbers from the web page) should be possible so that RECAP can separate one monolithically viewed document into each separate docket filing.

mlissner commented 1 year ago

It's unclear to me why this error started showing up. Previously, multiple documents selected and at least downloaded were separate .PDF files within the one .ZIP file.

That still works, in fact. Zips work as they always have. Combined docs never have, so the only change is that now we're warning (too much) about it.

Nevertheless, each docket entry should still have its respective docket number on it (e.g., 30-0 plus attachments to it, like 30-1 and 30-2) together perhaps also with page number for each. Parsing a single document (after scraping and obtaining the corresponding docket numbers from the web page) should be possible so that RECAP can separate one monolithically viewed document into each separate docket filing.

Thanks for the comment. Alas, this isn't as easy as it may seem:

Sometimes docs are filed in multiple cases and accrue multiple headings from the various cases.
Sometimes people have this feature turned off, so we don't get a heading.
These headings vary across courts (b/c of course they do).

It's one of those things we decided was too hard, but we do have a breakthrough over in #347 that should make it possible!

gbronner commented 12 months ago

From what I've seen of the actual merged pages, if you have the index page (or even if you don't), and you have the pdf, the watermark at the top of the page will tell you which subdocument you have and which page of it you are on. So this error message not only shows up on single file downloads, but seems like it could be worked around.

mlissner commented 12 months ago

Unfortunately, the watermarks on the PDFs aren't reliable. Some users disable them, and others upload documents that they re-purposed from other cases without removing the watermark. The result is that a watermark is usually fine, but can be missing, wrong, or duplicated.

The solution is over here though: https://github.com/freelawproject/recap/issues/347

gbronner commented 12 months ago

Here's an example of a watermark I downloaded a couple of days ago.

Is there some reason not to try to read the merged files and look for the watermarks and split them? Seems like the benefit to getting it right exceeds the cost of trying and throwing it out

Do we have an example of a mis-identified watermark?

gcklema commented 12 months ago

I've seen filings without a clerk/docket stamp. I have also seen filings with two such stamps--but it seems to me that the later-in-time one should govern as a rule since it's seemingly impossible to have a future stamp on a current filing, but not unheard of to re-file old, previously-stamped documents.

On Wed, Dec 6, 2023 at 2:10 PM Gregory Bronner @.***> wrote:

image.png (view on web) https://github.com/freelawproject/recap/assets/1834828/1ba531ff-a28a-4725-b713-2fed0609310a Here's an example of a watermark I downloaded a couple of days ago.

Is there some reason not to try to read the merged files and look for the watermarks and split them? Seems like the benefit to getting it right exceeds the cost of trying and throwing it out

Do we have an example of a mis-identified watermark?

— Reply to this email directly, view it on GitHub https://github.com/freelawproject/recap/issues/349#issuecomment-1843533821, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANLSNFKRWSJQMM2J5FGIRLDYIC7MFAVCNFSM6AAAAAA27GZAKWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBTGUZTGOBSGE . You are receiving this because you commented.Message ID: @.***>

mlissner commented 12 months ago

I don't have an example, but I still think the better and easier solution is https://github.com/freelawproject/recap/issues/347. Good point about selecting the latter date when encountering duplicates.

mlissner commented 2 weeks ago

@ERosendo, can you please give me a size estimate for analyzing the remaining task here (I haven't read through it in a year or so)?

ERosendo commented 2 weeks ago

can you please give me a size estimate for analyzing the remaining task here (I haven't read through it in a year or so)?

We can likely close this issue once pull request #402 is merged. The original problem was that the extension incorrectly labeled single-document download pages as "Multiple PDF" and displayed an unnecessary warning. This happened because we were relying on page elements rather than counting the documents. PR #402 addresses this by refining the logic we're using to identify these pages and avoid the warning.

@mlissner There's a separate issue about uploading combined PDFs that was discussed earlier. Are you referring to that issue when you mention the 'remaining task here'?

mlissner commented 2 weeks ago

Sounds great. I don't know what I was referring to, so I think we're OK here. I've put this on the current sprint so it can get wrapped up as part of this one and so it's not on your old board.

freelawproject / recap

Incorrectly identified split pages #349