Open flooie opened 10 months ago
Can you append your code you used to find the multiple cases in the same page. @quevon24
Can you check if docket number is available for each of your findings? I think it would be instructive. I would have assumed it would have been checked against both in the original matching code.
Yesterday I talked to @flooie about what could be done to solve these problems, he suggested that we use a command to test the list of matches we have, without modifying the system, just to log if the matches in the file are the same as we would get if we ran everything directly on the server, to separate what matches from what doesn't.
The command is ready, you can see it here for now: https://gist.github.com/quevon24/bdaa1b2d860878cb77ee9424b9c64d34
After this, I was cloning some cases to test how the command worked and validate that the matches were correct.
I also tried with the cases that we know were incorrectly matched and indeed the current code made them match despite being incorrect.
As mentioned before, part of the problem is that you can have several cases on the same page, and in very particular cases those cases have similar names with almost identical opinions. Analyzing this, I thought that if this case happens we could evaluate all the possible matches and see which one is most similar of all (perhaps based on the name of the case).
There is something important to mention, not all xml files have citations or fill date, so the command as it currently stands is going to ignore these files, to address these specific files we need to add another function to find if any cases match in the system with the xml using the court, the filling date, the docket number and the content of the opinion.
So far the two patterns I have identified to detect these incorrect matches are:
Yesterday I also updated the list with new incorrect matches
Thanks @quevon24
This is great. And one of your new examples is a great example of why this is so difficult and tricky.
new_york\court_of_appeals_opinions\documents\22b65cf3e46d5178.xml
was incorrectly matched - but when I found the opinion in our system it was a harvard merged with lawbox opinion. Those two opinions match - but the columbia data that also matches is missing the middle 95% of the opinion.
I'm not sure why - but it seems like the Columbia opinion was maybe a slip opinion - and the harvard and lawbox were the combined collection of the ruling + the supporting memorandum.
For the visual learners - the highlighted part is the memorandum - which is included in the two in our system but the columbia contains just the top and last line.
I think we need to step back and put down more information about this import/merger
3,978,306 - opinions in columbia never imported 730,514 -- imported from columbia 4,708,820 - opinion XML files
A few updates, i already tried to improve the process that compares case names to avoid using a set of false positive words like: united states, in re, people, etc. but so far no luck, i tried some algorithms we already have in the system like cosine similarity but the result is very similar to what we already have.
I tested a new approach with similarity but representing the case names as semantic vectors to try to not care only about the words but also the structure of the names, the problem with this approach is that it requires a trained corpus, i already tested a few ones but some words in the case names are not present there, so the possibility to make it work would be to train a corpus with the words of the case names, i don't know how long it will take to train it or if i can use some techniques like "transfer learning" to avoid starting from scratch.
I also tried comparing case names expanding the abbreviations in the name to increase the similarity between case names and the overlap, it improves the result, but sometimes is not enough to say that we are 100% sure that both case names are the same(from the file and cluster)
This part is essential in some edge cases because when we have a case with same court, same citation, same filed date, even very similar opinion content, the only thing that difference one match from another is the case name, for example with this cluster id 5834449 and this file: new_york\supreme_court_appellate_division_opinions\documents\27f1df354367d4c5.xml
Content from cl:
Appeal from judgment of the County Court, Westchester County, convicting appellant of the crime of burglary in the third degree, and from intermediate orders. Judgment unanimously affirmed. No opinion. No separate appeal lies from the intermediate orders, which have been reviewed on the appeal from the judgment of conviction. Present — Nolan, P. J., Wenzel, Beldock, Murphy and Kleinfeld, JJ.
Content from xml file:
Appeal from judgment of the County Court, Orange County, convicting appellant of the crime of assault in the second degree, and from intermediate orders. Judgment unanimously affirmed. No opinion. No separate appeal lies from the intermediate orders, which have been reviewed on the appeal from the judgment of conviction. Present — Nolan, P.J., Wenzel, Beldock, Murphy and Kleinfeld, JJ.
It seems as if they have filled out a template and just fill in the spaces.
When comparing both text, the algorithm that compares the documents give us 90% match. Almost identical, and there must be a threshold to decide if they are similar or not.
So the only thing left is to see that the names are different
"People v. Grubbs" from cluster and "PEOPLE v. FRYER" from file.
If we only use words overlap, it will give us "people" from overlap, and according to the current algorithm, it means that both cases are the same. The solution here is to add "people" to false positive list. The problem is that probably this is not the only word that we need to add to that list, but the problem is that those words have to be found by hand.
If we compare case names using cosine similarity, it will give us more than 50% match because the only difference in both case names are the last name.
The solution here is to add "people" to false positive list. The problem is that probably this is not the only word that we need to add to that list, but the problem is that those words have to be found by hand.
I think we're probably quite close to finding all of these kinds of words. I know Bill will have lots of thoughts about this (I feel like a passerby, so don't take me too seriously), but rather than training a corpus, why not make a simple word list by finding common words, looking at them for generic ones like "people" and then going from there?
I agree with @mlissner - I dont want to create a new corpus to do this. I think we should look at how courts do this and just replicate what they do.
I think I already have a list of words and text strings that might help. In some cases we only need to remove one word, but in other cases like "United states v. Foo" or something like that we need to remove the entire string(united states), we can't remove them using individual words(united or states) because it could affect other case names such as "DeSandolo v. United Airlines Inc.", it will result in: "DeSandolo v. Airlines Inc."
Words: L714
Strings: L733
I updated the list with new incorrect matches.
I have some updates, when i found the new incorrect matches, i found that in some cases the name is very different but the match is correct, i guess those where matched using the opinion content, for example:
In this cluster: https://www.courtlistener.com/opinion/2208256/go/ we have the case name: "Tl v. Wl"
but in the xml file delaware\court_opinions\documents\394ecae8e7788179.xml we have: "In Re Long v. Long"
We have the same court(delfamct), same citation(820 A.2d 506), same docket number(C502-04026), same filed date(2003-01-08) but the case names are completely different, I suppose that in this case they matched using the content of the opinion.
I am going to implement what I mentioned above to take into account the content of the opinions when verifying that the cluster matches the file.
I'm going to continue reverse engineering the list to see how else i can verify the matches.
There are only ~115k rows left to validate, among these are the incorrect matches.
Just as requested, I made some small graphs to have a better visualization of what needs to be validated to merge in courtlistener.
remaining columbia matches graphs.ods
Many of those that need to be validated are from: lactapp and nyappdiv Most are single opinions in the file. The highest concentration is between 1994-2011
Interesting. I didn't see any trends that make life easier here, but it sounds like you're making progress anyway.
Worth revisiting this one?
@quevon24 can you take this and fill in the gaps for me. I want to use this issue to both understand the mismatches you found in the matched columbia file, and so I can diagnose where and how it happened.
Description
There are are some files in the matched columbia file that are incorrect, cluster has no relation to file.
Add list mismatches
Explain how you found them
I assumed that we could have the same problem that occurred with Harvard: that there may be several cases on the same page (they share the same citation) with very similar content and name. As it is not common for this to happen, I took some citations randomly and manually verified the results, which allowed me to find the problems that I present in the table above. I found the last row in the table "by luck" reviewing some data randomly.
Possible Solution
A possible solution would be to process all the clusters where we have repeated appointments and try to use other data such as the docket number to ensure that we have the correct match. If we do not have that data, try to compare the names in a more exhaustive way (we can't use date filed or court because if there are many cases on the same page that data could be the same for all clusters).
Additional Context
Add any other context about the problem here, such as screenshots, logs, or links to relevant data.
Thanks for the help in advance.