Closed mlissner closed 7 years ago
Is there a way to do a raw Solr query with hit highlighting turned on or something?
Not without an SSH tunnel, no, but I can help you set that up if you want. You can see though in the link above that Ansel is coming up as an attorney for all the cases. He tells me that he's not the attorney on those cases.
Oh, and I don't think we can do highlighting in the attorney field anyway. The data structures for that made the index much bigger on disk.
Hmm. Well I may have some random time to poke around this week. Should I generate a keypair and send you the public key?
Sure, that'd work.
Well the good news is Solr is working...the query is matching documents in the index. Seems it's something wrong with logic feeding the documents to the index.
I'm afraid that's probably bad news. Reindexing this data ain't easy or fast.
Looks like a lot of repeated attorney
and attorney_id
. The id's look to be:
[56384, 397508, 262664, 397038, 262845, 72593, 97202, 53013, 72604, 72605, 72606]
2851 of the results have a 'filepath_local' field set. Multiple results share a value. (map of path value to count). I'm not sure what this field is, but it's not populated across all the results, but when it's present it doesn't seem to have much uniqueness.
{
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txnd.202294.docket.xml': 12,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txnd.202466.docket.xml': 7,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txnd.202293.docket.xml': 7,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txnd.200468.docket.xml': 10,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.flmd.277288.docket.xml': 9,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cand.257409.docket.xml': 12,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cand.245070.docket.xml': 15,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cand.241626.docket.xml': 14,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cand.240498.docket.xml': 31,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.mad.146587.docket.xml': 10,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ilnd.300502.docket.xml': 31,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txwd.700605.docket.xml': 27,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.215815.docket.xml': 28,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txnd.199669.docket.xml': 40,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txnd.198554.docket.xml': 45,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.gand.194023.docket.xml': 19,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.gand.194148.docket.xml': 10,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cand.243803.docket.xml': 32,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.nysd.383532.docket.xml': 2,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cand.242118.docket.xml': 33,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.gand.194164.docket.xml': 15,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ilnd.320305.docket.xml': 38,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.nyed.327290.docket.xml': 2,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.moed.122709.docket.xml': 23,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.mdd.195237.docket.xml': 54,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.234002.docket.xml': 7,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cand.178550.docket.xml': 202,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ord.110659.docket.xml': 30,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ohsd.168529.docket.xml': 76,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.220269.docket.xml': 51,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.237927.docket.xml': 12,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.231380.docket.xml': 51,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.220272.docket.xml': 33,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.220392.docket.xml': 14,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.nced.86342.docket.xml': 2,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.mdd.364736.docket.xml': 10,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cacd.494538.docket.xml': 170,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cod.143958.docket.xml': 35,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.moed.122997.docket.xml': 32,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ord.110912.docket.xml': 32,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ord.110913.docket.xml': 29,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.237930.docket.xml': 9,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ilnd.283835.docket.xml': 34,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ord.117108.docket.xml': 35,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.ord.117113.docket.xml': 26,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.mdd.362638.docket.xml': 11,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.235638.docket.xml': 57,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wvnd.18586.docket.xml': 7,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.240430.docket.xml': 7,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.nced.154021.1.2.pdf': 1,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.234007.1.0.pdf': 1,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.234007.16.0.pdf': 1,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.220345.docket.xml': 25,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.220346.docket.xml': 36,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.209993.docket.xml': 45,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.220344.docket.xml': 23,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cod.143983.docket.xml': 8,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cod.142810.docket.xml': 53,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.mnd.132795.docket.xml': 39,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.mdd.200994.docket.xml': 82,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.mdd.200997.docket.xml': 20,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.mdd.201002.docket.xml': 12,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.235209.docket.xml': 32,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txsd.1387141.docket.xml': 1,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.235213.docket.xml': 38,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.202457.docket.xml': 60,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.212874.docket.xml': 67,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.225089.docket.xml': 27,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txnd.202100.docket.xml': 16,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cacd.481253.docket.xml': 65,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.228588.docket.xml': 75,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.wawd.221830.docket.xml': 115,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cacd.431729.docket.xml': 15,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txsd.1370248.docket.xml': 57,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cacd.581122.docket.xml': 41,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cacd.572546.docket.xml': 25,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cacd.428785.docket.xml': 22,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cacd.624234.docket.xml': 10,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.iasd.49760.docket.xml': 43,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.iasd.50076.docket.xml': 9,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cacd.549424.docket.xml': 16,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.vawd.69351.docket.xml': 16,
'/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.cand.184511.docket.xml': 257
}
The filepath_local refers to an XML file that has information about the entire docket, so it will be the same across every document in the docket. That, I think, makes sense. You can translate these paths to paths on Internet Archive if you want to see examples. For example:
/var/www/courtlistener/cl/assets/media/recap/gov.uscourts.txnd.202294.docket.xml
Becomes:
https://archive.org/download/gov.uscourts.txnd.202294/gov.uscourts.txnd.202294.docket.xml
(That link will 302 redirect you to the correct server on Internet Archive, but don't trust the final redirected URL since it can change.)
I dug a bit on this issue tonight using this case as my example:
https://www.courtlistener.com/docket/4390575/serious-bidness-llc-v-does-1/
This case is on Internet Archive at:
It's pretty simple, just a few parties and one attorney. The data for this case is wrong in the DB.
The likely source of the error is the code here:
It's the ~150 lines that pulls the data from the XML and saves it to the DB. I skimmed it over a bit already, but didn't see any errors. More auditing needed...and probably some unit tests.
So. Good news! The database side of this is fine, but the solr side of it isn't. The issue is that when we added stuff we'd take a party, in this case, "Doe 1", and we'd say, who were the attorneys for "Doe 1". For generic parties like this one, we'd bring back lots of attorneys, and that'd be a problem. What we should have been doing (and what we do now) is say, "Who were the attorneys for Doe 1 in this docket?". That brings back a smaller and more accurate set of results.
This problem also applied to firms, which went from "Which firms was this attorney at?" to "Which firms was this attorney at while working on this case?"
This fix, aside from making our data correct, will also be nice because we should have significantly less duplication in Solr, and so our index should shrink noticeably. I also optimized a query pretty thoroughly, so I expect some improvements there too.
I'll be reindexing the RECAP content soon. When that happens, the data side of this will be fixed.
Btw, this is now the most complicated Django query I've probably ever written:
This was only fixed when updating by RECAPDocument, not when updating by Docket. Doh. Just fixed that as well.
And this fix brought our index size down by half! It's down to 150GB, which is a fantastic improvement.
A user pointed out today that searching for their name in the RECAP archive brings up lots of cases they weren't involved in:
https://www.courtlistener.com/?type=r&q=attorney%3A%22ansel+halliburton%22~2&type=r&order_by=score+desc
Sure enough, their name is listed in all the results. I have no idea how this could have happened. It's either an error importing the data (bad), an error in the XML from RECAP (bad), or an error in putting the data into Solr from the DB (also bad!).
It's very unlikely this is an easy or small bug to fix.