Label district court divisions based on case number prefixes

v-anne commented 3 months ago

This is a very small suggestion. The crux of it is to use district court case numbers to inform users which division of a district court a given suit was filed in.

For example, the Middle District of Florida has five different divisions, with cases in each having a different divisional number (2, 3, 5, 6, 8) assigned to them. We could take cases from a given district court, look at the first number in the case, and identify which division it was brought in. This could be helpful for people looking at patterns of forum shopping in litigation.

https://www.flmd.uscourts.gov/understanding-case-designation

Many other courts across the country have similar systems.

mlissner commented 3 months ago

Where do you imagine these labels showing up?

v-anne commented 3 months ago

Quick mockup:

I am not too attached to this location, it's just the first one that came to mind.

mlissner commented 3 months ago

Yeah, that's interesting and useful. Not easy though, I'm afraid. We have a table of "Courts", which are more like jusidictions, and a table of "Courthouses", which can have addresses and things like that, but the model doesn't have a place where division names would fit in, so I guess we'd need that.

Beyond that, we'd need to populate this info for every division, which might not be so hard, but could take some effort to get right. I like the idea though.

v-anne commented 3 months ago

Beyond that, we'd need to populate this info for every division, which might not be so hard, but could take some effort to get right. I like the idea though.

I am working on collecting the divisions for each district court. Might take some time, unfortunately, as I don't know people in every jurisdiction.

mlissner commented 3 months ago

I'm not sure, but does this help: https://free.law/xlsx/fjc/integrated-database/office-codes.xlsx

v-anne commented 3 months ago

I think it helps somewhat, based on the Middle District of Florida, but I think it breaks it into too many categories.

johnhawkinson commented 3 months ago

I'm not sure, but does this help: https://free.law/xlsx/fjc/integrated-database/office-codes.xlsx

Yikes, I continue to think this ancient weird aggregation of stuff should basically never be used for anything…

Beyond that, we'd need to populate this info for every division, which might not be so hard, but could take some effort to get right. I like the idea though.

If we're not already scraping and aggregating all the courtInfo.pl stuff into a table, we should probably start doing so yesterday.

And this info is front and center there, e.g. https://ecf.flsd.uscourts.gov/cgi-bin/CourtInfo.pl:

(by which I mean third).

It's also easily enough in the XML (although a notional scraper should grab the HTML because not everything is in the XML, even if the XML is easier to parse, or so goes my vague memory):

  <office>
    <0>Ft Lauderdale</0>
    <1>Miami</1>
    <2>Ft Pierce</2>
    <3>Zantac MDL-2924 ONLY</3>
    <4>Key West</4>
    <9>West Palm Beach</9>
  </office>

I think it helps somewhat, based on the Middle District of Florida, but I think it breaks it into too many categories.

What even does this mean?

v-anne commented 2 months ago

What even does this mean?

The spreadsheet is breaking up the courts into subsections smaller than their actual divisions. Take the Middle District of Florida, for example. Per the court's website, it has 5 divisions:

The spreadsheet says it has 9 subdivisions:

johnhawkinson commented 2 months ago

The spreadsheet is breaking up the courts into subsections smaller than their actual divisions.

Ah. I can see why you would read it that way, but I think it is rather the case that it is listing subdivisions that used to exist. I don't know if that's because it is a snapshot from an older time, or an aggregation over all time. And, I suppose, it is possible that we have cases with office codes that are no longer current offices, although I would tend to doubt it. Another interesting questions for those of us who specialize in corners is office codes that have changed. Of course I'm sure the courts would want to minimize that.

So, although I dislike that spreadsheet as a source of truth, I don't think it will give you incorrect information for current office codes.

v-anne commented 2 months ago

You might be right, particularly if courtlistener has older cases that were filed in some of those divisions. It might not hurt to use the spreadsheet if every current division is a logical subset of those divisions listed in it (meaning no new divisions were created with different case prefixes).

mlissner commented 2 months ago

FWIW, the spreadsheet came from the FJC, and is supposed to be used with the IDB, but it sounds like it'll work pretty well, all in all. The bigger question I have, though, is how we'll add the division information to our model.

johnhawkinson commented 2 months ago

The bigger question I have, though, is how we'll add the division information to our model.

Eh? It's part of the docket number.

v-anne commented 2 months ago

Does the existing schema allow us to apply this change solely to district courts (and not bankruptcy courts)?

mlissner commented 2 months ago

Eh? It's part of the docket number.

Yes, but we need to figure out where in our model to store the district names.

Does the existing schema allow us to apply this change solely to district courts (and not bankruptcy courts)?

Yeah, I think there are getters in cl.search.models.py that can identify just the bankruptcy or district courts. Not sure how easy they'd be to use in the HTML though. Any reason we can't apply this to bankruptcy courts too though?

v-anne commented 2 months ago

To my knowledge, the bankruptcy courts generally don't use the same divisions as they have fewer judges and cases.

johnhawkinson commented 2 months ago

To my knowledge, the bankruptcy courts generally don't use the same divisions as they have fewer judges and cases.

I don't do BK, but in response to your earlier comments I looked at MDFL's RSS feed https://ecf.flmd.uscourts.gov/cgi-bin/rss_outside.pl and its cases are replete with office codes. They do not seem to be the only ones.

mlissner commented 2 months ago

Sounds like an easy way to do this is to use the office code only if we've got it, and to stick with what we're doing now if we don't. I think that should cover bankr and district nicely. My one assumption is that bankruptcy courts within the same district do have the same office codes as their district brethren. I think that's safe, but if it's not I expect we'll learn as much very quickly.

v-anne commented 4 weeks ago

@mlissner @johnhawkinson, I just want to follow up on this, and I want to endorse scraping courtInfo.pl. It seems both accurate and up to date for a dozen courts I randomly checked. I think the the information about divisions could either be stored in a new model or as a few new column(s) in the Court model.

What do you think? I'm happy to tackle this. It seems attainable.

EDIT: I might've spoken too soon. There seem to be thousands of cases in CourtListener that have 0 as their prefix, but none of the ECF sites seem to explain what the 0 is for. Still, I think the vast majority of cases would be served by scraping and accounting for the other prefixes.

One potentially more significant issue is how to account for prefixes that have been reassigned between divisions. My sense is that it will not matter significantly as almost all cases on CourtListener have activity within the last 25 years, and I doubt many divisions have changed in that time. Still, I wanted to get your thoughts.

mlissner commented 3 weeks ago

+1 for scraping the courtInfo.pl to get this info.

As for where to put it, I'm still not sure. I think divisions are pretty analogous to our Courthouses table though, so maybe we should look at that? I don't think it'll work to just add it to the Court table though, since we'd need to add so many more courts — feels like the wrong layer of abstraction.

So let's say we put the divisions into the Courthouse table. That's great, but we still don't have a real linkage in the DB between the docket and those divisions. We'd still want that somewhere, or else when we populate the docket page, we'll have to parse the docket number then do a lookup based on it. That's pretty cludgey.

v-anne commented 3 weeks ago

As for where to put it, I'm still not sure. I think divisions are pretty analogous to our Courthouses table though, so maybe we should look at that? I don't think it'll work to just add it to the Court table though, since we'd need to add so many more courts — feels like the wrong layer of abstraction.

I looked at the Courthouses table, and I worry that it would require significant manual data annotation to put the courtInfo.pl data in it. Someone would have to look at each court and then determine which office code pairs with which courthouse. Would it be worse to make a new table focused on this data alone?

As an aside, I don't know what existing data in the Courthouses table looks like and would appreciate if you could provide a sample. Also, what parts of courtInfo.pl are we scraping? I assume the Courthouses table already includes the Court Locations information and aside from Court Offices, everything else doesn't seem that important to scrape. Do you agree with that assessment?

So let's say we put the divisions into the Courthouse table. That's great, but we still don't have a real linkage in the DB between the docket and those divisions. We'd still want that somewhere, or else when we populate the docket page, we'll have to parse the docket number then do a lookup based on it. That's pretty cludgey.

I agree that wouldn't be the best solution, and it likely wouldn't allow users to search for a particular division either (e.g., find cases filed in the Riverside Division of the Central District of California without knowing the office code). I think an ideal solution would include allowing users to search for divisions via the jurisdiction picker. For example, the Central District of California's three divisions could be nested under the Central District of California, with all divisions checked by default. However, I recognize that might make the page an eyesore that isn't performant due to how many divisions there are across the country.

johnhawkinson commented 3 weeks ago

Stepping back a moment, I thought I had asked this in August, but it looks like I didn't, or maybe I did in a different ticket, but anyhow:

How is this information useful? How and why do we envision users using the division information usefully?

I have some difficulty figuring it out. It's pretty rare that I want to limit a search for a case I'm unfamiliar with to a division, generally a district is the right abstraction. And if I am already looking at a case, I don't need to decode the division, the docket text of ,say, a hearing scheduling event, will tell me the particular courthouse in the same breath as is tells me the courtroom number.

So, ultimately, this feels slightly more useful than the fact that we look up judge's middle initials and replace them with full middle names (honestly a thing I wish we didn't do because I find it jarring and distracting when I am used to seeing them as First M. Last (FML)). Purely cosmetic is stretching it, but is there a large case for it to justify the work?

p.s.: The jurisdiction search picker is difficult and complex enough already. I suppose searching by division perhaps should be a consideration in redesign efforts, but I think that problem may be too hard already without increasing its difficulty level.

mlissner commented 3 weeks ago

Someone would have to look at each court and then determine which office code pairs with which courthouse. Would it be worse to make a new table focused on this data alone?

Actually, I think was wrong about this. I think on the case law side, what we do is we add each division of each court to the court table, and the courthouse table is just for actual court houses. Then, we use the parent_id field to establish parent-child relationships within the court table between courts and their divisions.

So, imagine that we have a legal decision from California Northern District. That'd go in cand. Simple enough. Now imagine that we know it's in the San Francisco division. That'd go in (notionally), cand_sf, or something like that, and cand_sf would indicate that cand was its parent.

When people do searches in cand, we also search cand's child courts (in this case cand_sf), and return those results. Same for cand_sf, but it wouldn't have any child courts, so users would just get results in its division, specifically.

This meets a few goals:

We have the ability to put things in the correct divisions, if we know that information.
We didn't have to add a divisions table, which would add complexity to a lot of things.
Users can search the court or the division.

It is a bit weird though, and if we learn the division info about a case the fact, the court_id field for it will change. That's weird.

I don't know what existing data in the Courthouses table looks like and would appreciate if you could provide a sample

https://storage.courtlistener.com/bulk-data/courthouses-2024-10-31.csv.bz2

what parts of courtInfo.pl are we scraping? I assume the Courthouses table already includes the Court Locations information and aside from Court Offices, everything else doesn't seem that important to scrape. Do you agree with that assessment?

Some investigation is probably needed here. I think the courthouse locations probably aren't in our system since so far we've only populated the courthouse table on an ad-hoc, as needed basis. But even doing that isn't really needed if we use the approach above (which doesn't require anything of the courthouse table).

I think an ideal solution would include allowing users to search for divisions via the jurisdiction picker. For example, the Central District of California's three divisions could be nested under the Central District of California, with all divisions checked by default. However, I recognize that might make the page an eyesore that isn't performant due to how many divisions there are across the country.

Yeah, the way it's automatically handled now is sort of addressing the eyesore problem while allowing the granularity if people are clever enough to do a search like court:cand_sf or whatever.

How is this information useful?

Off the top of my head:

I think there's a human piece to it. It's one thing to say it's in cand, it's another to say California Northern District, and it's yet another to say San Francisco or Oakland.
Years ago, I wanted to go sit in on a hearing or something and I didn't know how to figure out where it was. If the page said "Oakland" at the top, that'd have helped.
Trend analysis and searching? Everybody is paying attention to particularly conservative judges right now. It'd be nice to be able to query them in a better way?

I think I'm most convinced by number 2, but happy to hear other thoughts.

freelawproject / courtlistener

Label district court divisions based on case number prefixes #4372