freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
369 stars 110 forks source link

Create parsers in Juriscraper for the case filed reports #185

Open eads opened 7 years ago

eads commented 7 years ago

E.g. https://ecf.ilnd.uscourts.gov/cgi-bin/CrCaseFiled-Rpt.pl

I came up with a solution for this independently using this library and beautifulsoup:

    session = PacerSession(username='<MYUSER>', password='<PASSWORD>')
    intermediate_resp = session.post('https://ecf.ilnd.uscourts.gov/cgi-bin/CrCaseFiled-Rpt.pl?{0}-L_1_0-1'.format(randint(9000, 20000000)), files={
        "office": (None, ""),
        "case_type": (None, ""),
        "case_flags": (None, ""),
        "citation": (None, "18:1113.F"),
        "pending_citations": (None, "1"),
        "terminated_citations": (None, "1"),
        "cvbcases": (None, "No"),
        "filed_from": (None, "1/1/2010"),
        "filed_to": (None, "7/14/2017"),
        "terminal_digit": (None, ""),
        "pending_defendants": (None, "on"),
        "fugitive_defendants": (None, ""),
        "nonfugitive_defendants": (None, "1"),
        "reportable_cases": (None, "1"),
        "non_reportable_cases": (None, "1"),
        "sort1": (None, "case number"),
        "sort2": (None, ""),
        "sort3": (None, ""),
        "format": (None, "data")
    })

    intermediate_doc = BeautifulSoup(intermediate_resp.content, 'lxml')
    form = intermediate_doc.find('form')
    action = form.attrs.get('action')
    action_path = action.split('/')[-1]
    url = 'https://ecf.ilnd.uscourts.gov/cgi-bin/' + action_path
mlissner commented 7 years ago

This would be great to add. It looks like it essentially mirrors the FreeOpinionReport or the DocketReport.

Would you be interested in adding a new report for this? I think the goal would be to make it as similar to the other reports as possible. Common methods would probably be:

Since this form can return data, it looks like it's probably a pretty easy one to parse (assuming the pipe-delimited data they return is reasonably sane). I don't know what kind of normalization it would require, but I think that's probably going to be the hard part of this ticket.

Btw, the fun stuff you're doing in your code sample above to parse the form isn't necessary. I have no idea what the thing is that looks like a nonce on the form, but you can ignore it if you do something like this instead:

https://github.com/freelawproject/juriscraper/blob/57d62655ccdd28f19278e80ab2155dcaddecd7d7/juriscraper/pacer/free_documents.py#L74

eads commented 7 years ago

Ahhhh I was so curious how your code worked around the transaction cost form! That's lovely. And my code was written on deadline, so I was just banging together a thing that worked, not something meant to be good.

I'd be happy to add -- we'll probably want an inheritance tree where civil cases and criminal ones inherit from the same class, since they're basically the same thing with different root URLs.

btw, I'm moving out to the Bay Area (tomorrow) for the next year, would love to hook up with y'all at some point and chat more about how ProPublica is using this. Could you be so kind as to drop me a line? I'm david.eads -AT- propublica.org

mlissner commented 7 years ago

I'd be happy to add -- we'll probably want an inheritance tree where civil cases and criminal ones inherit from the same class, since they're basically the same thing with different root URLs.

Makes sense.

I've also wondered about refactoring all of this to have a common base object, Report. Some of the patterns are similar across reports, others aren't, but it could help to keep things tidy. Only reason that hasn't happened yet is because I was waiting to see how things fleshed out as more reports were coded up. Seems like things are gelling around query, parse, and url at least.

I guess the inheritance (if we wanted to do it right), would then be:

I'll shoot you an email too. It'd be great to hook up. I didn't realize you guys were using this!

mlissner commented 7 years ago

FYI, I just did a big reorganization in c97961818678ef765f3048e10e592cc078649071. You can see more about it in the commit message, but it should make adding these new reports easier — simply extend BaseReport, and you should be on your way.

eads commented 7 years ago

Good deal -- I probably won't get to this for another week or two, but it's still on the roadmap.

On Thu, Sep 21, 2017 at 5:45 PM, Mike Lissner notifications@github.com wrote:

FYI, I just did a big reorganization in c979618 https://github.com/freelawproject/juriscraper/commit/c97961818678ef765f3048e10e592cc078649071. You can see more about it in the commit message, but it should make adding these new reports easier — simply extend BaseReport, and you should be on your way.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/freelawproject/juriscraper/issues/185#issuecomment-331319755, or mute the thread https://github.com/notifications/unsubscribe-auth/AAmuzAprIPWd3Z_aEzgzarYmknCS3hvZks5skwM9gaJpZM4PCzhr .

-- David Eads | http://recoveredfactory.net

"Medical statistics will be our standard of measurement: we will weigh life for life and see where the dead lie thicker, among the workers or among the privileged." -- Rudolf Virchow

eads commented 6 years ago

Heh, thanks for the bump. Life comes at you fast.

On Tue, Feb 6, 2018 at 11:06 AM, Mike Lissner notifications@github.com wrote:

Assigned #185 https://github.com/freelawproject/juriscraper/issues/185 to @eads https://github.com/eads.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/freelawproject/juriscraper/issues/185#event-1460834446, or mute the thread https://github.com/notifications/unsubscribe-auth/AAmuzJluxj21STtx83siOFeE6fq92U1wks5tSKKugaJpZM4PCzhr .

-- David Eads | http://recoveredfactory.net

"Medical statistics will be our standard of measurement: we will weigh life for life and see where the dead lie thicker, among the workers or among the privileged." -- Rudolf Virchow

mlissner commented 6 years ago

Wasn't sure if it'd bump you. I was just trying to keep things organized somewhat.

mlissner commented 6 years ago

@eads, I assume no movement here means you haven't gotten to this, but before I worked on it (which may happen...soon?), I thought I'd check if you had any intermediate work you could share? Seems simple enough to do this, but figured I'd check.