freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
544 stars 150 forks source link

Merge the FJC's Integrated DB into the RECAP Archive #880

Closed mlissner closed 5 years ago

mlissner commented 6 years ago

The FJC hosts a pretty killer database that has all of the PACER cases in it and is updated quarterly:

https://www.fjc.gov/research/idb

Frankly, it's pretty amazing. Whenever it's convenient, I currently import their latest data into a table that I maintain in our DB. This table isn't visible anywhere, but I use it for things like bulk downloads from time to time because it allows me to do things like look up all the cases in a given NOS so I can download them.

Rather than have this in a separate DB, we should merge this data into RECAP. Several reasons:

  1. If we did this, RECAP would have a complete set of cases. Not having every case confuses people regularly: "I looked for case xyz and nothing turned up? What gives?" It's a fair question!

  2. There's a ton of metadata in the IDB that we could merge into RECAP. This would be a huge differentiator for RECAP since some of this isn't even stuff you can get from PACER itself, at least not as far as I know.

  3. Our count of cases and our conception of complete would change drastically. It'd be great to get this data.

Problems:

  1. When using PACER, you'd always see the option to get the docket from CourtListener, because we'd have them all. This isn't good since most of them would be IDB stubs with no parties or docket entries.

    I think this could be solved pretty easily by keeping a boolean flag that indicates whether it was solely an IDB-sourced docket. If so, then you know it shouldn't be shown by the RECAP Extension.

  2. ~When you place a query in RECAP, you could get back a lot of results without docket entries. Maybe this is fine since you'd only get back results based on fields that matched, like the case name. Maybe it'd be annoying. I'm not sure, but we should be ready for it and possibly have a field in Solr that allows us to avoid showing those kinds of results (kind of like our "only show PDF" results option).~

    Actually, on second thought, these items wouldn't come back at all! I forgot that RECAP search is first and foremost a docket entry query. If these items don't have a single docket entry, they won't wind up in Solr at all. I'm not sure what the solution is for this, but looking at the code it'll be pretty hard since for example, the ID in Solr is the ID from the RECAP Document table.

  3. How do we make decisions about canonical data? Which has the best data, PACER or IDB? I suspect this will vary on a field by field basis. For example, we know that the case names in the IDB are terrible, but it has dozens of other fields that seem pretty decent.

mlissner commented 6 years ago

These are the required fields in Solr:

 <!--IDs-->
<field name="id"
       type="int"
       indexed="true"
       stored="true"
       required="true"
       multiValued="false"/>
<field name="docket_entry_id"
       type="int"
       indexed="true"
       stored="true"
       required="true"
       multiValued="false"/>
<field name="docket_id"
       type="int"
       indexed="true"
       stored="true"
       required="true"
       multiValued="false"/>
<field name="court_id"
       type="string"
       indexed="true"
       stored="true"
       required="true"
       multiValued="false"/>

Only IDs, really. Interesting. If we add the FJC info, we'd lack the id, and docket_entry_id fields, and would have to fake them. I think we could make this whole thing work — maybe — by making docket_entry_id optional and by putting a namespaced id into the regular id field, since it can't be made optional.

For example, since we wouldn't have a document ID, we could just use negative numbers as our trigger. It's either that or make them obscenely large, but that spells trouble down the road. This needs more thought before such a kludgy solution is adopted.

johnhawkinson commented 6 years ago

I have several thoughts :) Sorry for the delay.

Rather than have this in a separate DB, we should merge this data into RECAP.

This sounds reasonable, but of course it's speaking to the underpinnings of how the database is implemented, rather than the outcomes that user would see. The latter is probably a more important starting point.

R1. If we did this, RECAP would have a complete set of cases. Not having every case confuses people regularly: "I looked for case xyz and nothing turned up? What gives?" It's a fair question!

But obviously not. Many people are looking for cases filed this morning, or yesterday, or in the newspaper this morning so filed earlier in the week, or filed last month when they read an article in a magazine. So this would not be a "complete set" of cases, and it would still miss a lot of recent cases. But there are other good ways to get those recent cases (e.g. RSS feeds are more and more prevalent), and of course by the time an opinion comes out, there's a fair chance that the case opening has made it into the FJC IDB, though of course not the case resolution.

P1. When using PACER, you'd always see the option to get the docket from CourtListener, because we'd have them all. This isn't good since most of them would be IDB stubs with no parties or docket entries.

We're kind of here already, in concept but at a smaller scale, with the written opinion scraper. As long as a case is resolved with a free opinion (most cases, I would allege), then it'll show up in RECAP, and if no one else with the extension ever looked at it (presumably the likely situation for most cases, which are not of particularly news interest), then we'd expect the coverage to be zero.

And yes, we're here, and it's already annoying. I used to be able to rely on the last-update date of the docket report as added to DktRpt.pl by the extension as the starting point for running a new docket report. But now it's basically useless in courts that have reasonable RSS feeds (an end-state we want). So yes, you're right, we need to do something about this.

I think this could be solved pretty easily by keeping a boolean flag that indicates whether it was solely an IDB-sourced docket. If so, then you know it shouldn't be shown by the RECAP Extension.

So no. And again, you're looking at implementations rather than feature tests. Maybe a flag that indicates whether there is any party information and any meaningful docket information. It's a good open question as to what level of docket information qualifies as meaningful. A single opinion from the written opinion scraper? What about some RSS-derived entries but no actual result of either a Docket Report run (or DHR) or PDF uploads?

But I would tend to error on the side of putting more information in what the extension shows. Maybe only for beta testers and then we spend a few weeks using RECAP and seeing what information is actually useful and how to organize it. We could wireframe/mock a few choices to help us think about this, but I think actual testing is the way to go. There's a lot more information that could easily be there. What if there were a Tuftean "sparkline" of docket entries?

P2. ~When you place a query in RECAP, you could get back a lot of results without docket entries. Maybe this is fine since you'd only get back results based on fields that matched, like the case name. Maybe it'd be annoying. I'm not sure, but we should be ready for it and possibly have a field in Solr that allows us to avoid showing those kinds of results (kind of like our "only show PDF" results option).~

Actually, on second thought, these items wouldn't come back at all! I forgot that RECAP search is first and foremost a docket entry query. If these items don't have a single docket entry, they won't wind up in Solr at all. I'm not sure what the solution is for this, but looking at the code it'll be pretty hard since for example, the ID in Solr is the ID from the RECAP Document table.

Well, there's a good argument that this behavior of the search is confusing and unexpected! (It seems to have surprised you, who should know it best). This is a good reminder that any point where the FLP schema diverges from the CMECF schema is a source of cognitive dissonance at best, annoying feature gaps at worst, and serious bugs at overworst [overwürst?].

P3. How do we make decisions about canonical data? Which has the best data, PACER or IDB? I suspect this will vary on a field by field basis. For example, we know that the case names in the IDB are terrible, but it has dozens of other fields that seem pretty decent.

"Uhoh."

At some point we're going to need to keep a history table of modifications because we have so many conflicting sources. And this is only going to get worse. Probably best to think this through sooner rather than later, regardless of whether that thinking leads to speedy implementation.

At the moment our answer is "use the latest thing we got." But you're right we need to be smarter about it.

These are the required fields in Solr:

Oof, that was hard to read. That is:

id      int
docket_entry_id int
docket_id   int
court_id    string

I feel like Solr is indexing the wrong thing, or at least, not indexing the right thing, and that is part of the problem. I should probably sit down and read the Solr documentation and look at your code and see what I think. Also Moonrakr.

think we could make this whole thing work — maybe — by making docket_entry_id optional and by putting a namespaced id into the regular id field, since it can't be made optional. For example, since we wouldn't have a document ID, we could just use negative numbers as our trigger. It's either that or make them obscenely large, but that spells trouble down the road. This needs more thought before such a kludgy solution is adopted.

That's terrible. You could make a new field for this purpose and add it to the requisite tables and join on it, that would be a much better option. Overloading a field with negative numbers is always a bad idea, and doing so with "obscenely large" numbers is like a double-red-card. Best to avoid getting ejected from the game.

mlissner commented 5 years ago

Finally working on this a bit and a vision is coming together. There's a few pieces of this that matter:

  1. RECAP Extension results
  2. The front end
  3. The API and database
  4. The data itself
  5. The search results (where the focus has been so far)

Let's analyze these in turn.

RECAP Extension Results

I think this one is simple for now. There are two kinds of lookups that RECAP can do:

Looking up a document won't be affected by these changes since we're not adding any documents. Looking up a docket is what we discuss above and is a real issue. The solution is pretty simple. Until we can do a better job in the extension, just don't show dockets that have the IDB as their single source.

Thus, a todo:

The Front End

The approach I plan to take here is to present this information in a new tab for each docket. There are something like 50 fields in the IDB, so displaying them above the tabs would be difficult. If we tuck them into a tab, we'll have the information available for folks, and we can include a blurb in there explaining what the data is and where it came from.

These tabs currently get their own URLs. Currently we have:

We also have documents in a similar namespace. This is document 1, attachment 1 from that the above docket:

/docket/4214664/1/1/national-veterans/

So following this approach, I think we wind up with a tab called "IDB Info" that's at:

/docket/4214664/idb/national-veterans/

The API and Database

We currently have two tables, one for IDB entries and one for Dockets. Dockets are available via our API at a Docket endpoint. I think the easiest thing to do here is to simply do a 1-to-1 join between these tables.

That'd add the information to the API as a dict in a new key called idb_data, which would be a non-breaking change to the API.

On the DB layer, it's just a new foreign key between the tables and it'll be easy (and efficient) to join these together whenever needed.

By keeping these pretty much separate, it is pretty easy to get fresh data from the IDB and to incorporate it into the IDB tables without dealing (too much) with the Docket tables.

The data itself

As far as the data itself goes, I need to decide how much of the data from the IDB should get incorporated into the Docket table. All of the data will be available via the IDB table (which will even be in the front end), so the only concern is how the Docket table gets data.

There are three types of fields to think about in the FJC data:

  1. Ones without analogous fields in the docket (for these we do nothing)
  2. Ones with populated analogous fields in the docket (these we skip)
  3. Ones without populated analogous fields in the docket (these we populate from the IDB data)

Some examples:

  1. Using an item from the IDB, we find the matching docket in our Docket table. For each of the analogous fields, we check if it's populated. If it is, we ignore it. If it isn't, we populate it with IDB data.

  2. Using an item from the IDB, we discover that the item is not in the Docket table. For each of the analogous fields, we populate it in the Docket table.

The assumption here is that the data in the Docket table is of higher, more recent quality than that in the IDB. For now, this is the safe assumption, since it merges less data. If we later find out that we should prefer to overwrite the Docket data with that from the IDB, we can do so later.

Search results

This remains a big question mark. I think the easiest solution is to create a fake docket entry for each docket. That'll allow the item to show up in search without having to change our search architecture. The fake docket entry can just be stubbed out, and can have a special flag saying that it is a stub, and then when we get the actual docket, we just make sure to delete that stub.

The alternative to this is to either:

  1. Not make these items searchable (a shame, but not that horrible, and anyway we'll be getting docket entries for a good percentage of these soon)

  2. Purchase at least one docket entry for each of these (hmm, that'd cost too much)

  3. Change our Solr schema, except this looks very hard since the ID field in the Solr schema is always the ID from the docket entry, not the docket, and I can't think of a good way to even get around this.

I think for now the solution is to keep these out of Solr. That'll be OK for now since people probably won't notice they're missing, and fairly soon we'll have a lot more of these dockets anyway. Ideally we'd get them in there, but I don't see an easy way for the moment and I don't want this issue to block progress here entirely since I think the rest is pretty straightforward.

mlissner commented 5 years ago

As far as how to pull all this together, I think the process is: