Improve alt text by providing document text

mlissner commented 1 year ago

It'd be nice if our alt text were better. Currently it just says, "Thumbnail of page X of the PDF".

First, we could make that better just by saying, "Thumbnail of page X of the PDF linked above."

That makes it more clear that clicking the link will get you the text.

But second, we have the text in our database. Could we do better by including it in the post? I think the answer is yes, and I think the way to do it is to ask the recap-document API for the text and use it if it's available, skipping it if not (implying that OCR is still going, which we shouldn't wait for).

When we get the text from the API, we won't know what text came from which page, so we'll just have to dump as much of it as we can along with each thumbnail, possibly with some explanatory text:

The first XXX characters of the PDF: "xxxx…"

Then on the second thumbnail:

The second XXX characters of the PDF: "…yyy…"

Then on the third thumbnail:

The third XXX characters of the PDF: "…zzz…"

(Of course, this should break at word boundaries, not in the middle of words.)

The alt-text bot goes further and even will do an image that just says "ALT TEXT" with additional text on it, but I think we can stop short of that (we're not providing all the pages, after all):

https://twitter.com/AltTextUtil/status/1653058214362238976

One open question is whether we'll want to pre-process the text to remove whitespace. I think we probably will, and I wonder if we'll want to go even further to remove dumb punctuation at the beginning too. Like, maybe all we want are the words? This will take some experimentation, I think.

mlissner commented 1 year ago

So a couple quick ways we could do better than just using whatever text we get back from pdftotext:

Yes, we should remove whitespace.
We should remove the PACER page headers. I think we have code for this somewhere.
We could try to identify paragraphs and do clever things. Maybe if it starts with more than 5-10 whitespace characters, we ignore the line? This is probably fraught...

Anyway, just some thoughts for trying to do this well.

mlissner commented 1 year ago

This would be useful for our blind users, I think, but I'm going to move it to our volunteer backlog. Having to click through to the PDF itself or to our website is a pretty normal thing to have to do whether you're blind or not, and this is a good fit for a volunteer in terms of complexity.

weedySeaDragon commented 1 year ago

I'd like to work on this.

I'm a long time software engineer (since the 80s!) but my python tooling is rusty.

A little background: My main languages have been OO, starting with Smalltalk-80, most recently ruby; also python, scala, and many more. Am a big proponent of SOLID & Patterns. Following Kent Beck since the 90s.) Done a huge amount of analysis & design, and project management. Have done lots of documentation and presentations (C-Level, developers, end-users.

I have this up and running locally in PyCharm and have done the simple, minimal change ("Thumbnail of page X of the PDF linked above."). I did that in order to get this running locally and to initially explore the code.

I think I've uncovered some inconsistencies with ARCHITECTURE.md but need to discuss my setup to be sure.
- What is the best way to talk about the problems I have with my setup? Email? Some group-chat or chat tool? Can definitely do it here if that's best.
Once my setup is correct, I can write the simple tests and submit a PR.
Then I have some thoughts about what the alt-text should and shouldn't contain when using the full(-er) text from the PDF.

Thanks!

mlissner commented 1 year ago

Hi, thanks for taking this on! I'm really glad to have help with it.

A few responses:

I have this up and running locally in PyCharm and have done the simple, minimal change

That's great. I really like small initial PR's. If you want to submit that alone, @ERosendo (the lead dev for this project) can give it a review.

I think I've uncovered some inconsistencies with ARCHITECTURE.md but need to discuss my setup to be sure.

What is the best way to talk about the problems I have with my setup? Email? Some group-chat or chat tool?

We have a Slack group I can invite you do, but why don't we keep with async:

For the architecture problems, you can probably just open issue. I'm not surprised if it's out of date.
For the set up issues, how about using the Discussions tab here in Github?

Once my setup is correct, I can write the simple tests and submit a PR.

Great!

Then I have some thoughts about what the alt-text should and shouldn't contain when using the full(-er) text from the PDF.

If you want to get those thoughts going here now, feel free, but your sequencing sounds great too!

mlissner commented 1 year ago

Just FYI, I just found your account in CL and gave you access to some more of the API. @ERosendo pointed out that you'll probably need the access to complete this issue. :)

weedySeaDragon commented 1 year ago

I'd like to change the alt text for an image. When describing an image, screen readers will already announce that it's an image because of the <img> tag, so there's no need to also say that something is an image in the alt text. So I'd like to change the alt-text from An image of the entry's full text: {text_image_description} to The entry's text: {text_image_description} (I should have caught this initially.)

That also removes "full" so that the alt text is a wee bit shorter.

mlissner commented 1 year ago

Sure, why not! :)

weedySeaDragon commented 1 year ago

So now I want to understand more about the next steps for alt text for PDFs. (Obviously I'm still learning a bit about the domain (legal stuff) & terms, and the systems & codebase.)

From above:

We should remove the PACER page headers. I think we have code for this somewhere.

By this do you mean that we should ignore the header info line that has the docket number, etc.? Ex: the blue text in the image below:
Removing the PACER page header I've poked around the doctor code and the function that gets the document number from the header: get_document_number_from_pdf(). Looks like we could copy parts of the code (regex, etc) to deal with the header.

@mlissner: Is that the area of code you were thinking of that works the PACER page headers?
add plain_text to the fields fetched in lookup_document_by_doc_id We'll need to get the plain_text for the document info returned in lookup_document_by_doc_id, yes?
Once we have the plain text:
1. get the text for the first p pages (whatever the number of thumbnails is; e.g. 4),
2. For each page: clean it up by removing the PACER headers, extra whitespace then return the first X chars, breaking at word boundaries
Is that right?

johnhawkinson commented 1 year ago

(just taking some of the easy ones)

2. We should remove the PACER page headers. I think we have code for this somewhere.

By this do you mean that we should ignore the header info line that has the docket number, etc.? Ex: the blue text in the image below:

Yes, that text, which is not always blue and not always at the top of the page. It is substantially duplicative of the metadata that the bot puts in the text of the tweet (document number; description or type of description (depending); and the date is nominally implied, except when its not)

… …

Once we have the plain text: i. get the text for the first p pages (whatever the number of thumbnails is; e.g. 4), ii. For each page: clean it up by removing the PACER headers, extra whitespace then return the first X chars, breaking at word boundaries

Is that right?

It sounds like it, but handling the first page of document is going to be tricky, because there are multiple frames of text that may not come out quite right sequentially (although looking at, e.g. https://www.courtlistener.com/docket/67271062/52/walt-disney-parks-and-resorts-us-inc-v-desantis/ as my test case, it seems mostly ok), and there's a lot of case metadata that should probably not appear in the alt text. The case caption, for instance. If it's a West Coast style court (e.g. California), the names and addresses of the filing attorneys appear above the caption block on the front page. What really should be in the alt text is the first full paragraph of the filing, and perhaps subsequent paragraphs.

I guess there are a lot of options as to how ambitious you want to be! It might be good to take a few examples of cases/documents/tweets and walk through them to discuss what the goals are.

weedySeaDragon commented 1 year ago

I 100% agree that we (perhaps that's me) need to get some examples together.

I agree the that text for the first page may be tricky, but I don't think that any text -- beyond that PACER header -- should be omitted. Sighted users will the filing attorneys, etc, and so should vision-impaired users.

It's tricky, I know, because we have to limit the size of the alt-text so it's reasonable (alt-text should be succinct), but hopefully provide enough information so that the content and context are clear. Don't people -- visually impaired or not -- assume that a lot of the first page is taken up by 'case metadata'? (I honestly don't know.) If so, having some of that as the alt-text for the first page makes sense.

Ultimately we can make some guesses, but having some visually impaired legal folks to give us feedback would be great. (For all I know, one or all of you fall into that group. :-) )

mlissner commented 1 year ago

@mlissner: Is that the area of code you were thinking of that works the PACER page headers?

Yeah, it's not as helpful as I'd hoped, but I guess it's a start.

We'll need to get the plain_text for the document info returned in lookup_document_by_doc_id, yes?

Yes, that should do it.

get the text for the first p pages (whatever the number of thumbnails is; e.g. 4),

I'm not sure it's worth trying to figure out page breaks. That's a pretty difficult or at least annoying task. I'd say just load up the alt text with as much as it can handle. So, if page 1 actually has 500 chars, but Twitter allows 1000, just put the first 1000 chars with thumbnail one, and chars 1,001-2000 with thumbnail 2, etc, ignoring which page goes with which text.

I actually think sighted people would even find this useful and it loads the tweet to the max with the small downside that the pagination isn't spot on (who cares?).

John says:

What really should be in the alt text is the first full paragraph of the filing, and perhaps subsequent paragraphs.

@weedySeaDragon replies:

I don't think that any text -- beyond that PACER header -- should be omitted. Sighted users will the filing attorneys, etc, and so should vision-impaired users.

I'm pretty sure I'm with John on this. Most legal docs start out pretty much the same way, listing a bunch of junk nobody really reads. It's easy to skip when you're sighted, but it'd get pretty old to have to read it in the alt text, if we can help it. For example, some docs use a stack of ) as a vertical line. Ugh, that'd be annoying in alt text.

Don't people -- visually impaired or not -- assume that a lot of the first page is taken up by 'case metadata'?

I certainly do, but if we load up the alt text as I described above, then the alt text on thumbnail one would be the place to start reading.

I agree about finding some visually impaired folks. I'll post this thread and see if anybody replies.

johnhawkinson commented 1 year ago

Oops, I typed a long reply an hour ago and failed to hit the Comment button. This is slightly redundant with Mike's comments, but not entirely.

I agree the that text for the first page may be tricky, but I don't think that any text -- beyond that PACER header -- should be omitted. Sighted users will the filing attorneys, etc, and so should vision-impaired users.

Well, the rationale for omitting the case caption is the same as the header — we already summarize that in the tweet text. And captions on documents can be really verbose, like when they list 20 defendants. But approximately nobody cares about that, they care what the text says. (And some courts have rules or practices that prohibit using "et al." on some kinds of filings.)

Also, sometimes, there will be a clerk's filestamp/timestamp/datestamp which is both hard to read (because it is stamped in ink and not laser printed) and mostly irrelevant. If you really care about the exact date the piece of paper was turned over across the counter, you're probably not going to be looking at the alt text.

Sighted users will the filing attorneys, etc, and so should vision-impaired users.

I think you dropped a word here and I'm not sure what it was. Generally sighted users don't care about the filing attorneys, and certainly not their office addresses. To the extent they do, it's a lot less relevant than the text of the motion, and it is available in other places (e.g. the Parties tab on Courtlistener). Also it doesn't tend to change from filing to filing, so it's not what Big Cases is about, which is breaking news / market-moving information. (I exaggerate, but only a little).

For instance, take https://www.courtlistener.com/docket/6639860/458/in-re-macbook-keyboard-litigation/ where the counsel name/addresses push the first paragraph onto page 2. Basically nobody cares about that info (and the courts probably regret their local practice on this, which predated electronic filing…). And that's not even a worst case. How about https://www.courtlistener.com/docket/7067512/1139/in-re-facebook-inc-consumer-privacy-user-profile-litigation/? There paragraph 1 doesn't even appear in the first four pages because the table of contents eats it. Perhaps that suggests if we are parsing the text we might change which pages we show? And the first page has … little of value, although it does have the Title. Or maybe https://www.courtlistener.com/docket/17084894/1/zepeda-rivas-v-jennings/ is another perverse example?

Ultimately we can make some guesses, but having some visually impaired legal folks to give us feedback would be great. (For all I know, one or all of you fall into that group. :-) )

Well, the point of the bot is to let people know about breaking developments in cases. The metadata on the first page is generally always the same from document to document within a given case, with the exception of the date and the title of the filing.

weedySeaDragon commented 1 year ago

Excellent educational info for me. :-) thanks. I now get that no one -- no matter their visual capabilities -- wants to read or hear the case meta data on the first page. (I do enjoy learning all of this. Really.) So I'll need to figure out how to skip the various forms of that . And I'll go with your "put the first 1000 chars into the 1st thumbnail, the next 1000 into the next thumbnail, and so on" approach and not worry about page breaks.

And thanks for the examples, @johnhawkinson . Those are great places for me to start. If either of you thinks of any other examples that are either typical or edge cases, that'd be helpful.

weedySeaDragon commented 1 year ago

Just FYI -- chime in if you have thoughts, but I'm not expecting replies.

I've been thinking ("percolating" is actually how I describe this specific activity) about how to handle ignoring the "meta case info" that can span the first x.y pages (1.5 pages, 0.8 pages, etc. is what I mean).

I was thinking about how I recognize where to start reading; how do I know what to skip over? I clue in on where I see the first text paragraph. (I think someone mentioned this already.) So now I'm playing around with that -- how can the system to recognize a text paragraph? I'm starting with some really simple assumptions: (1) the first line is indented more than 5 spaces, and words are separated by either one or two spaces; and (2) the next line is either the start of another paragraph ( = another indented line), or an un-indented line of text. These assumptions ultimately may not work, but it's a starting place.

If we can effectively recognize the first text paragraph, we can just skip over the "meta case info" no matter how long it is.

Also, the PDF to text conversion gives us \f -- the formfeed character -- at the end of each page. So if we have those in the plain text, it's easy to keep the alt-text specifc to the actual page that the thumbnail shows.

mlissner commented 1 year ago

Yeah, this will be the challenge for sure. If you're game, I'd suggest downloading the last 250 docs from the bot and building them up into a sample set that you can test against until your heuristics are working.

mlissner commented 1 year ago

A reached out to a vision-impaired friend with this question. He replies (I bolded a few things):

the general rule of thumb for me is striving toward maximum parity to the extent practicable. In other words, as a blind person engaging with this content, I want the precise same level of substantive information that a sighted person would get from the content in question. If the point is to share the full, unabridged text so that a sighted person would be reading that, then the image caption should strive for that same level of information. Obviously, there are constraints related to space available in an image caption so I mean it when I say “to the extent practicable.”

I'm not so sure this really helps, but my takeaway is that doing it as best you can is what you hope for, and that you want the substantive content.

weedySeaDragon commented 1 year ago

Excellent feedback. Info from actual users -- or people in the same group -- is always great.

And yup -- I'll create that data set. (Just the kind of geeky challenge I like.)

mlissner commented 1 year ago

I'm realizing we lost momentum here. Anything we can do on our end that'd help you pick it up again, @weedySeaDragon?

weedySeaDragon commented 1 year ago

It's me. I'm the problem.
(Sorry. Taylor Swift was just in town and seems everything and everyone has been using that.) I've been noodling with it. I need to write up my status and questions and post them here. Probably later this week. (Want to add the rest of the data to the bootstrap-dev #273 and get the that finished.)

freelawproject / bigcases2

Improve alt text by providing document text #230