freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
530 stars 144 forks source link

Judge first and last names are alphabetized together #3844

Open mlissner opened 6 months ago

mlissner commented 6 months ago

A client has identified two opinions where the judge data is very wrong.

In the first:

https://www.courtlistener.com/opinion/802753/united-states-v-kevin-snulligan/

The judges are:

But our judge field shows:

David, Diamond, Easterbrook, Frank, Hamilton, Ilana, Rovner

The second is basically the same:

https://www.courtlistener.com/opinion/179288/united-states-v-pescatore-chavis/

The judges are:

But our judge field shows:

Hall, Joseph, Katzmann, McLAUGHLIN, Peter, Robert

Yikes.

We need to urgently figure out what caused this and how widespread the damage is. Remember that we can look in and restore from our history tables.

mlissner commented 6 months ago

Looks like the second one was sourced from Harvard and scraping. Here's the Harvard source (it's correct): https://cite.case.law/f-appx/400/596/

mlissner commented 6 months ago

I found the main problem in find_all_judges a function called in the Harvard import, suggesting that this may affect a very large number of cases. find_all_judges does some clean up, then splits the words using a regex. At its end, on line 298, it sorts the words of the input, merging together first and last names:

https://github.com/freelawproject/courtlistener/blob/0fe663de6cdcb13e98abbfa751332e1742addd58/cl/people_db/lookup_utils.py#L265-L299

For example:

In [30]: find_all_judges('Present: JOSEPH M. McLAUGHLIN, ROBERT A. KATZMANN and PETER W. HALL, Circuit Judges.')
Out[30]: ['HALL', 'JOSEPH', 'KATZMANN', 'McLAUGHLIN', 'PETER', 'ROBERT']

Oof. We do have tests for it, but I guess they didn't catch this.

And, unfortunately, that's not all I found tonight. find_all_judges calls wrap_text. I'm not sure what wrap_text is supposed to do, but it says it should "Wrap text to specified length without cutting words", but...it cuts words and I don't see how it wraps text?

In [3]: wrap_text(20, " Hall, Joseph, Katzmann, McLAUGHLIN, Peter, Robert")
Out[3]: ' Hall, Joseph, Katzmann,'

It's used in a couple places we'll need to check and maybe fix, but I still don't know what it's for?

And finally, because I'm on a bit of a sad-roll, I also note that wrap_text lacks tests, even though they'd have probably caught this and would have been easy to write. :(

I think this will affect a lot of Harvard content, so let's see if we can fix it and learn from our mistakes.

flooie commented 6 months ago

@quevon24 can you dive into this asap?

flooie commented 6 months ago

Okay, lets assess the damage.

  1. First I dont think this occurred in the regular import (atleast not yet). I tested the code in the harvard opinion import and it correctly extracted the judges.