Upload RECAP content to Internet Archive

mlissner commented 6 years ago

The end of quarter is fast approaching, and we need to upload everything that's changed this quarter in the RECAP Archive. For a sense of scale, as of now, that's:

About 100k dockets:

Docket.objects.filter(date_modified__gte='2017-09-01', source__in=Docket.RECAP_SOURCES).count()`)

About 2.9M docket entries:

DocketEntry.objects.filter(date_modified__gte='2017-09-01').count()

About 4.7M document metadata files:
```
RECAPDocument.objects.filter(date_modified__gte='2017-09-01').count()
```
(This number is insanely high this quarter because of issue #774, which scraped millions of IDs we formerly lacked.)
An unknown number of actual PDFs, but it's probably something in the hundreds of thousands. Somehow we need to figure out which PDFs were uploaded during this quarter. Oddly, I don't think we store this in the database. We may have to use the creation date of the files themselves.

This is...a lot to upload. We've got a few options:

Try to do it as similarly as possible to the old format: One XML or JSON file per docket, and then a bunch of documents.

Pros: A reasonable number of files. Familiar format. Cons: Need to render entire dockets even if only one item on it changed. Rendering big dockets can take a very long time. Doesn't create versioned snapshots of the data. Questions: How will people know what's new? RSS feeds certainly won't work.
Just generate and upload a single tar file per quarter or one per court.

Pros: Fairly easy to work with for consumers. Allows us to later make a tar of literally everything, if we have a need. Don't have to render complete dockets. Creates versioned snapshots of the data. Only one file to upload. Cons: Not the format people are familiar with. One massive file can be hard to work with (you have to download the whole thing to see what's inside, for example). Generating this kind of file takes space locally, and the process can fail. Questions: Do people care about getting per-court files? Is it worth making a sample file, with, say, 1000 items?
Upload one JSON file per changed object type, and put them all in a directory on IA. For example, upload the docket (which has metadata), the

Pros: Easier to consume. Closer to current format. Cons: A LOT of small files each quarter. Might not be feasible to even upload like this in a reasonable time frame. Probably not possible to know what's part of the latest dump. No versioning of files.
Database dump of changed data + tar of PDFs.

Pros: Fairly easy, I think, to generate, probably faster than generating JSON. Cons: Dump might include fields that shouldn't be shared. Not a super useful format.

Having walked through these, I think I'm leaning towards generating one file per quarter and uploading that. It'll be a big change for consumers, but I think it's a reasonable way forward. It provides the highest fidelity of the data, is a reasonable number of files to upload to IA, will be taxing but not horrid on our CPU (I hope), and will be somewhat easy to work with. It'll also provide a clear, "This is what changed" statement for consumers, which I think is something that's been lacking in the past.

I'd love more thoughts on this from the community.

mlissner commented 6 years ago

Copying @jjjake, an engineer at IA, in case some of the numbers here freak him out.

@jjjake, just to refresh your memory, I think we're one of the bigger uploaders to IA via the RECAP Project. The data itself is from federal court cases. Historically, we have had one Internet Archive Item per case, and then placed an XML file with metadata about the case, and a bunch of actual PDFs inside that Item.

About a month ago, we rewrote most of our architecture, and we haven't been uploading much since. The changes I'm proposing here are to try to create a new system for this that we'll be pushing once per quarter, instead of incrementally.

You commented before that I could pull you into IA-related convos, so I thought I'd do so here in case you see any issues with our plans. Any ideas or help that you can chime in with, if you feel like we're going off the rails or doing something that'd be bad for IA would be greatly appreciated. Thank you as always.

kmayer commented 6 years ago

Random, jet-lagged thoughts:

Breaking the data format will cause pain, so you should set a new “path” so that when traversing old data, ETLs won’t asplode on a different format. Not knowing much about the current format, choose to break it carefully. If speed is an issue, can you parallelize the rendering of the dockets? What’s the snapshot of? The dockets or the changes to the dockets? I’m unsure about versioning. We keep the original html of each docket report that we collect, so we can go back and review changes from canonical data. Big file downloads start to break down at about 5GB (for my residential cable-internet service). If the dump file is bigger than that, can it be resumed or split? I don’t know what the common use-cases are for the Internet Archive snapshots. Are users concerned about single cases, single courts, or larger aggregates? Optimize for the most common use case and provide translations for the others. SQL dumps are lowest common denominator formats. You can redact fields during the data dump (some special configuration of which columns from which tables).

johnhawkinson commented 6 years ago

Just generate and upload a single tar file per quarter or one per court.

Pros: Fairly easy to work with for consumers. Allows us to later make a tar of literally everything, if we have a need. Don't have to render complete dockets. Creates versioned snapshots of the data. Only one file to upload. Only one file to upload. Cons: Not the format people are familiar with. One massive file can be hard to work with (you have to download the whole thing to see what's inside, for example). Generating this kind of file takes space locally, and the process can fail. Questions: Do people care about getting per-court files? Is it worth making a sample file, with, say, 1000 items?

Err.

I'm not sure if you are distinguishing between the upload format used to transfer data to IA and the format the IA will present to users.

It sounds from the above like you're not distinguishing. In which case, I think it would be a huge problem to change the format, and especially to change the major characteristics. To have a gigantic tar file for each court rather than to be able to browse an HTML page for each docket?

That's a huge change and IMNSHO very unwelcome, and will cause a lot of criticism from people who are already (not unreasonably!) critical of the unexplained quarterly update frequency.

I don't think I would concur that it's "fairly easy to work with for consumers."

I think I'm leaning towards generating one file per quarter and uploading that. It'll be a big change for consumers, but I think it's a reasonable way forward.

Disagreement.

mlissner commented 6 years ago

I'm not sure if you are distinguishing between the upload format used to transfer data to IA and the format the IA will present to users.

That's always how it has been. IA doesn't generate any files for us or anything like that, if that's what you're thinking. They just present what we upload, whether it's HTML or XML or whatever.

I don't think I would concur that it's "fairly easy to work with for consumers."

I think this depends on the use case. For this quarter, we've got literally millions of pieces of data that are updated. As a data consumer (i.e. a researcher or an organization), I think I'd rather get that as a single file than to have to do millions of little downloads. In fact, I don't even know how a data consumer would be able to identify all the new stuff that was posted if we pushed it as little files.

As a human on a browser, a multi-GB zip file is pretty terrible, but I think serving an HTML file for each case on IA is a use case that's better met by CourtListener and other organizations. Using IA as a website for this kind of data is pretty inscrutable for most people. (How do you find cases other than by URL hacking? How do you browse anywhere? What's an XML file? Etc.) Posting HTML also creates confusion, since people might think that the HTML there was being updated, when it wasn't (though I suppose a notice to this effect could help.)

If you're advocating for HTML to be generated and posted to IA (as it was in the past), can you say more about what the use cases are?

mlissner commented 6 years ago

Oh, and as for changing the format, I think that's a given. Two reasons:

We have more granularity than we used to and many more fields. The data is just more complex.
The old format, which put every docket entry in a single file, had severe performance issues on big dockets. Ditching that format is kind of a must, just for performance reasons alone. Generating one file cannot take hours to do. That caused all kinds of problems on the old system.

anseljh commented 6 years ago

Do we know what kind of tools people have built to ingest IA data so far? That might help.

mlissner commented 6 years ago

Not particularly. The consumers I know of are:

RECAP users who may have links to old content (in any scheme discussed so far, these links would continue working, but not get updates).
Plainsite, though I don't know what their plan is these days, now that they no longer support us officially.
unitedstatescourts.org, whom I've invited by email to comment on this ticket.

I can't think of many others that are out there, TBH.

bgedelman commented 6 years ago

I tend to link to RECAP dockets stored at IA, both from web pages discussing cases and in internal notes (for myself and research assistants). I link to link to the IA copies, rather than a specific service ingesting RECAP dockets, because I view the IA copies as the copies of record. (Well, PACER would be if they didn't charge!) I agree that it's not always easy to find the IA link, but RECAP itself helped me do so.

My instinct is that it's nice to have a human-readable version of the case archives in IA. That's part of the overall RECAP project as I always understood it -- RECAP gathers the case documents and dockets, then reformats them in a way that can be preserved by IA indefinitely and read by users with standard tools. Downloading a huge ZIP would be much less useful to readers.

I'm sure it's a pain to make the HTML, yes, but it only has to be done periodically (I guess once per quarter if that's all RECAP is going to be providing to IA going forward, though I still think more frequent would be preferable and should be just as feasible). Seems worth it to me given the benefits to the many users who want to read the materials and also the benefit of having an authoritative archive.

bgedelman commented 6 years ago

As an example of a RECAP-generated IA-archived page I actually use: http://ia801908.us.archive.org/8/items/gov.uscourts.cand.300367/gov.uscourts.cand.300367.docket.html . That's a link I took out of a Dropbox I share with coauthors, for an article about Airbnb v. San Francisco and related matters. We are very grateful to have this material preserved by RECAP, available to all of us for free, and indefinitely preserved by IA. I chose to preserve that URL in our Dropbox, 15 months ago, because I knew that URL would let us see further case updates -- as, indeed, it has. I would prefer not to see RECAP retract this functionality.

johnhawkinson commented 6 years ago

Using IA as a website for this kind of data is pretty inscrutable for most people. (How do you find cases other than by URL hacking?

URL hacking is not in of itself bad. With properly designed URLs, URL hacking ought to work, and ought to work well. It may not be the best way, but it can be A Good Way. In fact, when URL hacking doesn't work on a website, it's a good sign things aren't so great...

The IA themselves recommends using wget to do mass downloads: https://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/

In fact, I don't even know how a data consumer would be able to identify all the new stuff that was posted if we pushed it as little files.

From the file timestamps?

If you're advocating for HTML to be generated and posted to IA (as it was in the past), can you say more about what the use cases are?

Well, anyone who is depending on the current formats. HTML indexing of XML, PDF, and more HTML. It's a nice format, works great in browsers, it's not hard to generate, it's easy to understand, we should not break past promises if we can avoid it. That is, we should have a strong bias to preferring the status quo, stare decisis let's say, because changes come with costs and so there's value to sticking with what we have. Especially after recent disruptive changes.

Oh, and as for changing the format, I think that's a given. Two reasons:

Option 1 doesn't seem to be a format change? Are you saying it's not a real option? Further, neither of these two reasons seem to be more than declarative statements without support?

We have more granularity than we used to and many more fields. The data is just more complex.

So add fields to the existing format? It's not inextensible. Am I missing something?

The old format, which put every docket entry in a single file, had severe performance issues on big dockets. Ditching that format is kind of a must, just for performance reasons alone. Generating one file cannot take hours to do. That caused all kinds of problems on the old system.

I don't think this problem had anything to do with the format. "Somebody accidentaly wrote some n^5 code, but we're not using that now."

I'm sure it's a pain to make the HTML, yes, but it only has to be done periodically (I guess once per quarter if that's all RECAP is going to be providing to IA going forward, though I still think more frequent would be preferable and should be just as feasible).

Err, I'm not sure what kind of pain we're talking about. My presumption is the bulk of the pain is programming pain. Once done once, it will be done forever. So the "only has to be done periodically" doesn't really apply.

(also, spending 1 time unit per day versus 7 time units per week is still the same amount of time.)

anseljh commented 6 years ago

As a compromise, can we add a big red warning box to the current IA HTML dockets, directing users to the corresponding new CL docket? I am not sure if that is any easier than regenerating the whole HTML -- @mlissner?

Or one better, also automatically redirect IA users to the CL docket, for example with a <meta http-equiv="refresh" content="0; url=https://www.courtlistener.com/docket/4182920/airbnb-inc-v-city-and-county-of-san-francisco/" /> in the <head>?

mlissner commented 6 years ago

I suspect automatic redirects would be unhappily greeted, but I think a warning with a link on the top of every IA HTML docket is probably a good idea. It'd be a fair amount of work, once, but much easier than regenerating every changed docket every quarter.

bgedelman commented 6 years ago

I wrote up a quick piece for my web site with my assessment of the situation and some further recommendations: Keeping Free Law Free.

johnhawkinson commented 6 years ago

@bgedelman: While there's a lot of legitimate criticism, the claim that undergirds your post is incorrect, and IMNSHO grossly misleading:

Under the new plan, rather than making all RECAP-collected documents available to the public as soon as possible, FLP would hold the documents until the end of each quarter for a batch update to the Internet Archive (IA).

The documents are available to the public nearly instantly (seconds — much faster than the earlier RECAP server, which was minutes-to-hours). They're on courtlistener.com rather than archive.org but that doesn't mean they're unavailable.

I don't fully understand what the fee issue is, but I gather it relates to bulk [data?] access from large "customers" (whatever that means) but I don't believe there's any intent to charge individuals for document retrievals.

bgedelman commented 6 years ago

@johnhawkinson : Yes, RECAP now promptly provides all documents on Courtlistener.com. But how should we think about that site? Currently no TOS disallow scrapers and the like. Yet the stated reason for data being sent to CL (and not IA) is to keep RECAP/FLP "sustainable" which I take to mean "making some money" which I take to mean "withholding features from those who don't want to pay." Bulk downloads, information about what's new -- these are natural features to withhold, and indeed we've already heard that bulk RECAP users (those who want much or all RECAP data for whatever reason) are already being asked to pay. But if those features (and bulk customers' willingness to pay for those feature) prove insufficient to raise the revenue FLP wants, will FLP withhold some other features too? You might imagine FLP charging for access to the newest documents, or the oldest ones, or the longest ones, or the ones in the most popular or most obscure dockets... or just about anything, really. And if the data is most or best in CL, and substantially not at IA or elsewhere, FLP might well be able to do it.

That sure isn't what I signed up for when I started contributing data to RECAP and encouraging others to do so. I thought my contributions would be freely available, on a best-effort basis as quickly and completely as possible, to the entire world without restriction. That's what I was interested in.

To me the principle is at least as important: RECAP data is a public resource, not owned by FLP, hence best handled via prompt uploading to an independent repository consistent with the public purpose and public origin.

johnhawkinson commented 6 years ago

@bgedelman my point is narrow. Your blog post says:

"rather than making all RECAP-collected documents available to the public as soon as possible"

which is equivalent to saying "RECAP-collected documents are not publicly available as soon as possible," and that is a falsehood. I think you should correct your blog post to make clear you are concerned with future/speculative (or theoretical) reductions in access, but not with the level of timeliness of access to RECAP documents right here and now.

A reasonable person would read your post to say that RECAP-collected crowdsourced documents are only available on a batched quarterly basis. That misimpression concerns me.

As I said yesterday, I think there's a lot of legitimate criticism of the current approach. But by presenting the situation as if it were worse than it is, it undermines your argument and misleads others about the state of access.

mlissner commented 6 years ago

As a first step here, I'm going to start uploading PDFs on a nightly basis using our old system of URLs. We're already doing this for free opinions, so I just need to tweak that code to do it for the rest of the content too. With any luck, that'll start uploading tonight. It's about 80k PDFs that we've got in the backlog.

jjjake commented 6 years ago

@mlissner My apologies, I just noticed this issue now.

Generally speaking, we're more concerned with the total item count than the number of total files uploaded.

I'm still reading through the issue, but it sounds like 1 docket would mean 1 item on archive.org (so, 100k for this backlog?)? If so, that sounds reasonable to me. As far as files per item, we generally recommend limiting items to 10k files and keeping them less than ~100GB. If that's not possible, let me know and I'll see what we can do (these are somewhat arbitrary guidelines, but there does reach a point where the item breaks/becomes inaccessible).

I'll read through the issue more carefully, but saw you were about to start uploading and wanted to get in touch incase you have any urgent questions about that.

It's great to hear that this is being worked on. I'll make sure to keep an eye out for any issues that come up. Thanks!

jjjake commented 6 years ago

As for your options, I think IA would prefer:

Try to do it as similarly as possible to the old format: One XML or JSON file per docket, and then a bunch of documents.

I think this would make the data more accessible, and be less confusing for people browsing the collection on archive.org. Uploading this way would allow us to derive the PDFs into other formats as well, making them searchable:

https://archive.org/search.php?query=test&and%5B%5D=collection%3A%22usfederalcourts%22&sin=TXT

Some tips for uploading this many files to archive.org efficiently:

Upload to items concurrently rather than uploading files concurrently to a single item. Item's can only run one task at a time, so they will pile up quick and you will get rate-limited once they stack up.
Use ia upload ... --retries=100 (or in python: item.upload(..., retries=100)) to automatically retry on failed requests.

From a technical perspective, creating a single tar file per quarter or court would be fine for us. Also note that you can view the contents of a tar (or zip, iso, etc.) file by appending a slash to the download URL:

https://archive.org/download/IRS990raw-2013_10_T/2013_10_T.tar/

You can also link to a specific file in the tar:

https://archive.org/download/IRS990raw-2013_10_T/2013_10_T.tar/2013_10_T%2F04-2810022_990T_201209.pdf

The single tar file solution would not be ideal from our perspective (mostly in terms of accessibility), but I understand if this is the only reasonable solution for you. At least the data will be getting backed up! : )

(Also note that those URLs will redirect to something like https://ia800404.us.archive.org/tarview.php?tar=/8/items/IRS990raw-2013_10_T/2013_10_T.tar&file=2013_10_T%2F04-2810022_990T_201209.pdf. Any URL like this, with the server number (e.g. ia800404), is NOT a permalink. We shuffle around our data from server to server, so this URL will change. For permanent URLs, the /download/ links should be used.)

In summary, the big concern would be creating too many items in the short term (millions). Let's talk more if that's where you're headed. Otherwise, everything should be fine from our perspective, technically speaking.

jjjake commented 6 years ago

One more thing... If you end up uploading the PDFs directly to the item, it would be helpful if you uploaded with no-derive:

$ ia upload ... --no-derive

Or, in Python:

>>> item.upload(..., queue_derive=False)

I will then setup a throttled job on our end to derive the items in a way that won't consume our whole derive queue (just ping me if you go this route, so I can set this up). Thanks again.

mlissner commented 6 years ago

Thanks for all this info, @jjjake. Yesterday I ended up kicking off the upload before running out the door, so I didn't read these tips until after the fact unfortunately. One thing you suggested was to do use no-derive. I didn't do that yesterday, but we'll be uploading PDFs every night going forward, and I could do that for those. Sounds like that wouldn't be ideal though, since you'd have to set up a throttled process for them each day? (Also, it's at night, so maybe it doesn't matter?)

Re linking to items in a tar or zip: I assume this is horribly slow to load because it has to open the zip when somebody hits the URL. Is that wrong, i.e., do you guys pre-open the zips to support the links to stuff inside of them? That's clever if so.

The rest of the limitations sound good. I don't think they'll be an issue. Thank you again. Super helpful to get these tips. Is it worth putting these into the python library's readme as a performance tips section or something?

jjjake commented 6 years ago

I won't have to kick it off every time. We have an internal tool that can loop over your uploads perpetually, deriving any new items it finds. We do this quite a bit and it's pretty easy to setup.

Do you have an idea of how many PDFs you'll be uploading nightly? It might not matter, but I can check in with our books team to see what they think. It's more of an issue that will come in waves. For example, if another group starts ingesting a lot of PDFs as well. The cause for concern is that a single uploader ingesting many documents can hog the derive queue. This is an issue that we should solve internally, but for now we have this workaround (uploading with no-derive, and having our perpetually looping task do a throttled derive). If you don't have a good idea of how many PDFs you'll be uploading nightly, feel free to keep uploading as is and I can let you know if it becomes an issue.

Yes, the larger the tarballs or zips get, the slower it is to load files from within. We do not pre-open the the zips, but that's an interesting idea!

Yes, these tips should be documented more clearly. I'll work on that, thanks for the feedback!

mlissner commented 6 years ago

Do you have an idea of how many PDFs you'll be uploading nightly?

This year so far, we've averaged about 2100/night. Happy to do the no-derive approach if it helps on your end.

jjjake commented 6 years ago

@mlissner I just checked in with the books team, and they said don't worry about it for now. We'll let you know if it becomes an issue, but you should be fine.

mlissner commented 6 years ago

I'm looking at this issue again and trying to grapple with it. Since my original post, we've scaled up the number of documents we're ingesting considerably. Here are the updated notes:

ago = now() - timedelta(days=90)

# Number of dockets changed in the last 90 days:
Docket.objects.filter(date_modified__gte=ago, source__in=Docket.RECAP_SOURCES).count()
1332736  # 1.3M

# Number of changed docket entries
DocketEntry.objects.filter(date_modified__gte=ago).count()
10676090  # 10.6M!

# Number of changed documents:
RECAPDocument.objects.filter(date_modified__gte=ago).count()
16938025  # 16.9M!

This is a lot.

mlissner commented 6 years ago

Dividing these numbers by 90, this is:

~15k dockets/day
~119k docket entries/day
~188k recap document objects/day

Holy smokes.

mlissner commented 6 years ago

Considering our current JSON format, which mirrors our database format, we can nest the recap documents inside the docket entries. So that brings us down to about 119k + 15k = 134k/day or about 1.3M + 10.6M = 11.9M uploads/90 days (quarter).

That's...a lot.

mlissner commented 6 years ago

OK, so here's one idea: https://hashrocket.com/blog/posts/create-quick-json-data-dumps-from-postgresql

mlissner commented 6 years ago

Alright, here's an example of the JSON I plan to start uploading to IA soon: https://drive.google.com/file/d/11yz4U6VNOfx9sQ2kTIhqRQUTizgwCarc/view?usp=sharing

To anybody that cares, please review this and let me know what you think. This includes nearly all of the metadata fields that we have in our servers (though note this is just dummy data in this example).

Of note:

I've got this down to a stable 19 queries per docket. This is pretty good considering the complexity of the data.
This does not include the text of the documents. I could include it, but since it's a big field, it's much faster to leave it out. Theoretically, people can get it from the documents themselves, and I've never heard anybody looking for this before so I think this should be OK.
This does not include metadata for some fields that are just kind of messy and which we generate ourselves by parsing various fields. Users can do this themselves and use our code if they want.
The performance here still isn't great, but I can do the Lehman Brothers bankruptcy docket on my laptop in about a minute. Server should be faster.

mlissner commented 6 years ago

OK, I'm putting a bow on this and saying it's done. Here's the architecture:

We have an upload script called upload_to_ia.py, which can take an argument to do oral arguments, opinions, or recap data.
That script is run around the clock by supervisor on our old server, and finds dockets that need to be uploaded.
When items are found that need uploading, one Celery task is triggered per item to generate the JSON for it using Django REST Framework. The JSON is then uploaded to the Internet Archive.
Dockets are marked as needing upload whenever they are changed in a material way (for example, they're not marked this way when their view_count field is incremented). They are also marked with information so that you can figure out when they should be uploaded, whether it's this quarter or next.

Marking the dockets with when they need updating is a critical feature because it allows us to only upload a docket once/quarter. If we didn't do that, I don't know how we'd possibly be able to handle the performance at 100,000 docket entries per day.
When no more items are detected as needing upload, a capped exponential backoff is triggered, which slows down how often we look for more items. This way, once the quarter's uploads are completed, we don't continue looking for new stuff as frequently.
When a new quarter begins, the process begins anew, automatically because the script will continue looking for stuff that needs uploading and will suddenly find stuff that's supposed to be uploaded in the newly hatched quarter.

Currently, I have this throttled to a queue length of about five items. This means that in general, we will have about five simultaneous uploads happening whenever we have things that need uploading. I don't currently see a big performance issue with this, and it's currently logging the following to /var/log/ia_recap_log.txt:

INFO:cl.lib.command_utils:Uploaded 41400 dockets to IA so far (92.4/m)

That speed has stabilized over the last several thousand dockets, so I think we can trust it. If we need to boost this speed, we can lengthen the queue, but for now I think it's probably an OK speed to let run for a while. It'll just keep going until it's done.

With this in place, I'll begin prepping a blog post with this announcement and the accomplishment in #861. Assuming nothing comes up, I'll close this issue, finally, once that blog post goes live.

Thank you all for your patience here, and for your input. This wasn't an easy one and I'm happy we're uploading this content again.

mlissner commented 6 years ago

Blog post is out: https://free.law/2018/09/11/uploading-pacer-dockets-and-oral-argument-recordings-to-the-internet-archive/

And the uploader has been running smoothly. Closing this one until I hear otherwise.

freelawproject / courtlistener

Upload RECAP content to Internet Archive #783