freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
544 stars 150 forks source link

Lawyer data wrong in Solr #660

Closed mlissner closed 7 years ago

mlissner commented 7 years ago

A user pointed out today that searching for their name in the RECAP archive brings up lots of cases they weren't involved in:

https://www.courtlistener.com/?type=r&q=attorney%3A%22ansel+halliburton%22~2&type=r&order_by=score+desc

Sure enough, their name is listed in all the results. I have no idea how this could have happened. It's either an error importing the data (bad), an error in the XML from RECAP (bad), or an error in putting the data into Solr from the DB (also bad!).

It's very unlikely this is an easy or small bug to fix.

voutilad commented 7 years ago

Is there a way to do a raw Solr query with hit highlighting turned on or something?

mlissner commented 7 years ago

Not without an SSH tunnel, no, but I can help you set that up if you want. You can see though in the link above that Ansel is coming up as an attorney for all the cases. He tells me that he's not the attorney on those cases.

mlissner commented 7 years ago

Oh, and I don't think we can do highlighting in the attorney field anyway. The data structures for that made the index much bigger on disk.

voutilad commented 7 years ago

Hmm. Well I may have some random time to poke around this week. Should I generate a keypair and send you the public key?

mlissner commented 7 years ago

Sure, that'd work.

voutilad commented 7 years ago

Well the good news is Solr is working...the query is matching documents in the index. Seems it's something wrong with logic feeding the documents to the index.

mlissner commented 7 years ago

I'm afraid that's probably bad news. Reindexing this data ain't easy or fast.

voutilad commented 7 years ago

Looks like a lot of repeated attorney and attorney_id. The id's look to be:

[56384, 397508, 262664, 397038, 262845, 72593, 97202, 53013, 72604, 72605, 72606]

2851 of the results have a 'filepath_local' field set. Multiple results share a value. (map of path value to count). I'm not sure what this field is, but it's not populated across all the results, but when it's present it doesn't seem to have much uniqueness.

{
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txnd.202294.docket.xml': 12,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txnd.202466.docket.xml': 7,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txnd.202293.docket.xml': 7,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txnd.200468.docket.xml': 10,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.flmd.277288.docket.xml': 9,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cand.257409.docket.xml': 12,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cand.245070.docket.xml': 15,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cand.241626.docket.xml': 14,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cand.240498.docket.xml': 31,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.mad.146587.docket.xml': 10,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ilnd.300502.docket.xml': 31,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txwd.700605.docket.xml': 27,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.215815.docket.xml': 28,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txnd.199669.docket.xml': 40,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txnd.198554.docket.xml': 45,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.gand.194023.docket.xml': 19,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.gand.194148.docket.xml': 10,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cand.243803.docket.xml': 32,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.nysd.383532.docket.xml': 2,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cand.242118.docket.xml': 33,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.gand.194164.docket.xml': 15,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ilnd.320305.docket.xml': 38,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.nyed.327290.docket.xml': 2,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.moed.122709.docket.xml': 23,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.mdd.195237.docket.xml': 54,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.234002.docket.xml': 7,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cand.178550.docket.xml': 202,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ord.110659.docket.xml': 30,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ohsd.168529.docket.xml': 76,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.220269.docket.xml': 51,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.237927.docket.xml': 12,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.231380.docket.xml': 51,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.220272.docket.xml': 33,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.220392.docket.xml': 14,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.nced.86342.docket.xml': 2,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.mdd.364736.docket.xml': 10,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cacd.494538.docket.xml': 170,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cod.143958.docket.xml': 35,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.moed.122997.docket.xml': 32,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ord.110912.docket.xml': 32,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ord.110913.docket.xml': 29,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.237930.docket.xml': 9,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ilnd.283835.docket.xml': 34,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ord.117108.docket.xml': 35,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ord.117113.docket.xml': 26,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.mdd.362638.docket.xml': 11,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.235638.docket.xml': 57,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wvnd.18586.docket.xml': 7,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.240430.docket.xml': 7,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.nced.154021.1.2.pdf': 1,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.234007.1.0.pdf': 1,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.234007.16.0.pdf': 1,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.220345.docket.xml': 25,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.220346.docket.xml': 36,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.209993.docket.xml': 45,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.220344.docket.xml': 23,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cod.143983.docket.xml': 8,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cod.142810.docket.xml': 53,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.mnd.132795.docket.xml': 39,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.mdd.200994.docket.xml': 82,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.mdd.200997.docket.xml': 20,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.mdd.201002.docket.xml': 12,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.235209.docket.xml': 32,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txsd.1387141.docket.xml': 1,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.235213.docket.xml': 38,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.202457.docket.xml': 60,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.212874.docket.xml': 67,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.225089.docket.xml': 27,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txnd.202100.docket.xml': 16,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cacd.481253.docket.xml': 65,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.228588.docket.xml': 75,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.221830.docket.xml': 115,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cacd.431729.docket.xml': 15,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txsd.1370248.docket.xml': 57,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cacd.581122.docket.xml': 41,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cacd.572546.docket.xml': 25,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cacd.428785.docket.xml': 22,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cacd.624234.docket.xml': 10,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.iasd.49760.docket.xml': 43,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.iasd.50076.docket.xml': 9,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cacd.549424.docket.xml': 16,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.vawd.69351.docket.xml': 16,
 '/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cand.184511.docket.xml': 257
 }
mlissner commented 7 years ago

The filepath_local refers to an XML file that has information about the entire docket, so it will be the same across every document in the docket. That, I think, makes sense. You can translate these paths to paths on Internet Archive if you want to see examples. For example:

/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txnd.202294.docket.xml

Becomes:

https://archive.org/download/gov.uscourts.txnd.202294/gov.uscourts.txnd.202294.docket.xml

(That link will 302 redirect you to the correct server on Internet Archive, but don't trust the final redirected URL since it can change.)

mlissner commented 7 years ago

I dug a bit on this issue tonight using this case as my example:

https://www.courtlistener.com/docket/4390575/serious-bidness-llc-v-does-1/

This case is on Internet Archive at:

https://ia600209.us.archive.org/20/items/gov.uscourts.txnd.202294/gov.uscourts.txnd.202294.docket.xml

It's pretty simple, just a few parties and one attorney. The data for this case is wrong in the DB.

The likely source of the error is the code here:

https://github.com/freelawproject/courtlistener/blob/6cfe9c4f59d24197cd21d5bd9bf833901c152ae5/cl/lib/pacer.py#L338-L503

It's the ~150 lines that pulls the data from the XML and saves it to the DB. I skimmed it over a bit already, but didn't see any errors. More auditing needed...and probably some unit tests.

mlissner commented 7 years ago

So. Good news! The database side of this is fine, but the solr side of it isn't. The issue is that when we added stuff we'd take a party, in this case, "Doe 1", and we'd say, who were the attorneys for "Doe 1". For generic parties like this one, we'd bring back lots of attorneys, and that'd be a problem. What we should have been doing (and what we do now) is say, "Who were the attorneys for Doe 1 in this docket?". That brings back a smaller and more accurate set of results.

This problem also applied to firms, which went from "Which firms was this attorney at?" to "Which firms was this attorney at while working on this case?"

This fix, aside from making our data correct, will also be nice because we should have significantly less duplication in Solr, and so our index should shrink noticeably. I also optimized a query pretty thoroughly, so I expect some improvements there too.

I'll be reindexing the RECAP content soon. When that happens, the data side of this will be fixed.

mlissner commented 7 years ago

Btw, this is now the most complicated Django query I've probably ever written:

screenshot from 2017-05-11 14-57-22

mlissner commented 7 years ago

This was only fixed when updating by RECAPDocument, not when updating by Docket. Doh. Just fixed that as well.

mlissner commented 7 years ago

And this fix brought our index size down by half! It's down to 150GB, which is a fantastic improvement.