bdheath / Big-Cases

The basic code behind the @big_cases Twitter bot
107 stars 20 forks source link

Upload to the RECAP Archive, also a few other things. #3

Closed johnhawkinson closed 6 years ago

johnhawkinson commented 6 years ago

I spent a while this week addressing some low-hanging fruit in Big Cases. The bulk of this patch is support for uploading to the RECAP Archive. It's from me with a little bit of help from @mlissner.

Note that this requires authorization. @bdheath , your personal CL API token should be enabled (add to bigcases_setings.py), although you should probably register a separate one for the bot and use that.

This also brings along for the ride the metadata that Aaron asked for recently on Twitter in this thread https://twitter.com/johnhawkinson/status/946201604461678595.

I don't have the whole Big Cases infrastructure running, so the changes there are untested. But the recapupload.py code has been decently well-tested (with an out-of-tree driver), and the changes to add it are minimal.

And then some typo fixes. I made no attempt to conform the existing code to PEP-8, although recapupload.py complies.

There are some design choices in Big Cases that made this trickier. It would be nice to just upload the raw RSS XML to RECAP, except that's not saved in the database. (Also RECAP doesn't have an endpoint to receive it, but that could be arranged. Indeed, Mike would prefer to do it that way, but I convinced him that this way was the speediest. But a future rev could use that. Such an endpoint would reduce the roundtrips.) It can be partially reconstituted from the fields in the Big Cases database, though.

Also, it would have been nice if Big Cases could save the parsed datestamps instead of the raw RSS/XML datestamps, so this code doesn't need to know about RSS time formats (import feedparser). Oh well.

This code doesn't make any attempt to deal with the case of merged attachments. That's probably worthy of some attention.

I think there's something else I'm forgetting but I'll add it in later if so.

mlissner commented 6 years ago

I just read through this and it looks great to me. I haven't tested it of course, but it passes my muster. Some small things I noticed:

  1. RecapUpload can only ever return None, but you're capturing it when you do recap = RecapUpload(...). Any reason to capture the returned value?

  2. We're adding a bunch more code here that looks like it could block the next step of the process. Is it worth wrapping the call to RecapUpload in a try/except so that if/when it fails it doesn't block other things? Or is it better to let it fail and have Brad figure out the failures as they happen and catch them properly?

  3. Your get and post requests — indeed, all such requests in the program — need to have timeout parameters or else they can make the program hang indefinitely. I think that usually happens when the server is old or weird, but I've seen lots of indefinite hangs caused by this, and they're always irritating. This is a really big wart in the python requests library.

Anyway, none of this really affects functionality much. It's mostly just optimizations and tweaks. On the whole this looks really great to me, and I'm super excited that you found time to work on this.

johnhawkinson commented 6 years ago

RecapUpload can only ever return None, but you're capturing it when you do recap = RecapUpload(...). Any reason to capture the returned value?

That's not how classes in Python work? The class initializer __init__() must always return None. When a new object is created, it calls new and creates an object and calls __init__() on it. That said, it's true we're not using the returned class object, because everything happens in the initializer. I think the answer is, "this is more Pythonic."

  1. We're adding a bunch more code here that looks like it could block the next step of the process. Is it worth wrapping the call to RecapUpload in a try/except so that if/when it fails it doesn't block other things?

I don't throw any exceptions explicitly, I think they are few and far between. I don't think this is necessary, but Brad should feel free.

  1. Yourget and post requests — indeed, all such requests in the program — need to have timeout parameters or else they can make the program hang indefinitely.

Sure, I'll do that.

mlissner commented 6 years ago

I'm thinking of are the exceptions requests throws if a server is down, for example. We could start by catching those?

johnhawkinson commented 6 years ago

I think there's something else I'm forgetting but I'll add it in later if so.

Ah yes, encoding! So, as far as I can tell, Big Cases doesn't do character entity decoding, so one gets bizarro-looking links like this:

>>> import feedparser
>>> f=feedparser.parse('https://ecf.dcd.uscourts.gov/cgi-bin/rss_outside.pl')
>>> f.entries[0].description
u'[Order on Motion for Extension of Time to File Response/Reply]'
>>> f.entries[1].description
u'[Compel] (<a href="https://ecf.dcd.uscourts.gov/doc1/04506367896?caseid=189074&amp;de_seq_num=179">34</a>)'

With &amp; instead of &, or worse. Instead of:

>>> from HTMLParser import HTMLParser
>>> HTMLParser().unescape(f.entries[1].description)
u'[Compel] (<a href="https://ecf.dcd.uscourts.gov/doc1/04506367896?caseid=189074&de_seq_num=179">34</a>)'

As far as I can tell Big Cases doesn't do this unescaping. Yet it seems to work fine. So maybe there's unescaping going on that I could not find, in which case this should be removed:

https://github.com/johnhawkinson/Big-Cases/blob/32becef81d2ee60658c29c2c78111a3f15af27ad/recapupload.py?utf8=%E2%9C%93#L84-L91

I guess the more likely result is that PACER doesn't actuallly care, since the de_seq_num parameter is not required and so a case where it is instead passed as amp;de_seq_no doesn't cause a problem.

johnhawkinson commented 6 years ago

Also, it seems like this: https://github.com/bdheath/Big-Cases/blob/841f225790ec60419a403f18342a1ff653cd6051/bigcases_scrape_docs.py#L110

isn't great. Since users download PDF files from DocumentCloud and this controls their name, it seems like the name should be meaningful in some way. And pid is a totally arbitrary field dictated by the Big Cases database, not by anything from the court or PACER or CM/ECF. That's really beyond the scope of this PR, but also seems worth mentioning. The IA RECAP filenames (gov.uscourts.…) are an obvious choice that's easy to generate.

johnhawkinson commented 6 years ago

FYI, I forked/filter-branched this code into its own repo for maintainability reasons, and because I wanted to have some issue tracking that was independent of Big Cases. So that's at https://github.com/johnhawkinson/recapupload.

@mlissner that's probably the place to open issues if you want it to use an RSS/XML endpoint, or the docket attachment upload, etc.