freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
369 stars 110 forks source link

Broken scraper: opinions.united_states.state.texag (NO LONGER PUBISHING) #262

Closed arderyp closed 1 year ago

arderyp commented 5 years ago

A message was posted some time ago to the opinions page indicating that AG opinions must be requested and will no longer be published publicly. Consequently, we should either call the court to see if we can get some sort of data stream, or kill this scraper.

https://www.texasattorneygeneral.gov/attorney-general-opinions

mlissner commented 5 years ago

Cool. I'll give a call. Going to be a tough one, but may as well try.

arderyp commented 5 years ago

I believe I did email them at one point and they basically said "no"

ix4 commented 5 years ago

I'm a novice but I think this may be what you're looking for: https://www2.texasattorneygeneral.gov/opinion/index-to-opinions

arderyp commented 5 years ago

thanks for the link @ix4

Oof, that's a pretty gross interface. Not sure how we'd even approach scraping that, or if all the metadata is present. Want to have a look @mlissner?

ix4 commented 5 years ago

So ugly. It's almost as if it's on purpose. I was thinking, after an initial scrape, perhaps using their email notification subscription (maybe with kill-the-newsletter.com type feed) could work nicely being that it'll come in structured (subject, date, attachment...)

mlissner commented 5 years ago

kill-the-newsletter.com looks neat, but I'd rather not rely on it. I think this isn't so hard to scrape really.

Some thoughts:

johnhawkinson commented 5 years ago

Not sure how we'd even approach scraping that,

I feel a little confused at what the question is. Although this interface seems completely unacceptable for use by humans, it doesn't seem hard to scrape? You just pull links out of these horrid option dropdowns:

Screen Shot 2019-06-22 at 07 39 26
arderyp commented 5 years ago

@johnhawkinson, I’m on the road and reviewed the site in my phone. After about 20 seconds of looking at those forms it seemed pretty funky, but I admittedly didn’t examine it very thoroughly. I see the listings now in the dropsdowns, is the pertinent metadata accompanying each dropdown option, or would we have to scrape the actual opinion PDFs? On mobile I don’t immediately see titles and dates, for example. I guess we can use the “estimated” date logic, but titles?

Sound like a good idea @mlissner to contact the court before starting the hacking. If we have to go the later route, sounds like @johnhawkinson can whip something together quickly to get what we need.

johnhawkinson commented 5 years ago

Err…I just don't understand what you meant by "not sure how we'd even approach scraping."

Did you mean about metadata? I was just referring to getting the PDFs. Yeah, there's no metadata on the dropdown page. Though the subject index Mike points out does have something, so perhaps that's the better way to go, although it's quite a few pages to scrape.

Screen Shot 2019-06-23 at 17 41 23
mlissner commented 5 years ago

Yeah, @johnhawkinson, one thing we always need for opinions are the date, title (this is the case name, usually), and a few other bits of metadata. A link to just a PDF doesn't really work for us very well, though we've hacked our way through it in the past by using something like, "Unknown title" for the case name.

I think we're all agreed dropdowns are horrid but workable.

mlissner commented 5 years ago

@arderyp, do you think you could write them an email? Should we get you a free.law email address so you're more official?

arderyp commented 5 years ago

I’m slightly confused @mlissner. You think the dropdowns are worth scraping without titles and with estimated (year only) dates?

I’m happy to send an email with or without a FLP address. I’ve had decent success without, so far. Am I just asking if they have an alternate, more easily scrapeable data source (that includes title and other metadata)?

I am out of town for the next two weeks, but happy to look into this when I get back.

arderyp commented 5 years ago

The subject pages have titles (are those titles?) but i don’t see dates.

Seems like we could scrape year+docket+url from the dropdowns, or title+docket+url from subject pages, but nowhere can we get all 4 (date+title+docket+url)

johnhawkinson commented 5 years ago

Scraping both just doesn't seem horrible. Suboptimal, yes, but easily doable.

mlissner commented 5 years ago

@arderyp, let's start with an email and see if we can get them to make a better page. I'd frame it more as a "Can we work with you on this process" type of email where we want to get a dialog going.

If it comes to it, maybe it's a good idea to gather titles from that other page, yes. We could do it using a deferring list, I think. Maybe. I wouldn't be opposed to going off the usual Juriscraper template if that's what it took to scrape/merge these two pages.

arderyp commented 5 years ago

@mlissner, okay.

arderyp commented 5 years ago

@mlissner I contacted the court through this web form: https://www.texasattorneygeneral.gov/contact-us-online-form

If you know of some other better way, please let me know.

mlissner commented 5 years ago

The chief....clerk of the court in Texas Supreme is super active on Twitter. If I had a hope of getting something out of Texas, he's where I'd start. OTOH, I suspect the AG is its own thing and they've got a phone number? Maybe?

On Wed, Jun 26, 2019 at 7:20 AM Philip Ardery notifications@github.com wrote:

@mlissner https://github.com/mlissner I contacted the court through this web form: https://www.texasattorneygeneral.gov/contact-us-online-form

If you know of some other better way, please let me know.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/freelawproject/juriscraper/issues/262?email_source=notifications&email_token=AABZ3KVY2HLCTPZWA3DFTETP4N3KFA5CNFSM4HMVZJRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYTV67Q#issuecomment-505896830, or mute the thread https://github.com/notifications/unsubscribe-auth/AABZ3KUQ5STNA6ENORTLPB3P4N3KFANCNFSM4HMVZJRA .

-- Mike Lissner Executive Director Free Law Project https://free.law

arderyp commented 5 years ago

@mlissner no response yet, and I dont have twitter

mlissner commented 5 years ago

Left a (regrettable) message with the "Open Gov't Hotline" at the TX AG. It's mostly for FOIA, but it might have somebody with the right kind of mentality to help us, so I'm giving it a shot. They promise to call back in a few days.

mlissner commented 5 years ago

OK, no surprise except that they called back after hours: I've been sent onwards to the AG constituent services. Let's find out if they consider us a constituent.

arderyp commented 5 years ago

best of luck

mlissner commented 5 years ago

Well, I spoke to Albert in the constituent affairs division. He forwarded it to the website division, who may get back to me. I'll try to stay on this. The number is: 512-475-4413.

mlissner commented 5 years ago

TIL that they take lunch extremely seriously. If you call during lunch hour 10-11PST, somebody answers, but you have no choice but to call back later. Sigh.

mlissner commented 5 years ago

Just tried again. Left a message. So during lunch they have somebody to answer phones. After lunch they do not. Sigh.

mlissner commented 5 years ago

Alright, so I talked to their public affairs office again and they just say to forward a message via the contact form on the site, so that's probably a waste of time. They actually have an "Opinions Committee" as well, so I called them, but they pointed me to "constituent affairs" as well. Looks like the only options here are to do a letter writing campaign (ugh), or to deal with the site as it is. There's also a way to subscribe to get emails about these, so I'll sign us up for that via our usual email address for that purpose.

So I think if we want to get these opinions we need to either throw some labor at it every day (not possible at the moment), figure out what West/Lexis uses for the title of these (anybody able?), or just put some sort of placeholder for the label that we dream up.

arderyp commented 5 years ago

oof, sounds like all bad options. I would definitely be interested to hear from people (West/Lexis) who are successfully scraping these opinions, but I have no contacts there myself.

mlissner commented 5 years ago

I was thinking they probably use humans to solve this, but that it might be good to see what their solution looks like if anybody can pull up an example in West/Lexis.

arderyp commented 5 years ago

I also don't have access to those systems to compare myself, but surely someone else does.

flooie commented 1 year ago

I updated the Texag scraper to use @ix4 suggestion. They dont produce that many opinions and the sidebar actually posts the most recent opinions. Currently only three in the last two months. I suggest in my PR to just scrape that for now and we can write a back scraper for missed opinions at some point