Discuss and Resolve Columbia Merger/Matching Issue

flooie commented 9 months ago

@quevon24 can you take this and fill in the gaps for me. I want to use this issue to both understand the mismatches you found in the matched columbia file, and so I can diagnose where and how it happened.

Description

There are are some files in the matched columbia file that are incorrect, cluster has no relation to file.

Add list mismatches

cluster_id	file	diagnose
2585129	washington/supreme_court_opinions/documents/fc1502147ffa1fe4.xml	multiple cases in the same page
5834449	new_york/supreme_court_appellate_division_opinions/documents/27f1df354367d4c5.xml	multiple cases in the same page
5862684	new_york/supreme_court_appellate_division_opinions/documents/545a23fac438021d.xml	multiple cases in the same page
5854148	new_york/supreme_court_appellate_division_opinions/documents/e133760de478978f.xml	multiple cases in the same page
1698585	florida/court_opinions/documents/a9b71e72d5c6be69.xml	found manually
9385264	new_york\court_of_appeals_opinions\documents\6cdd98244940e57f.xml
2630950	colorado\court_opinions\documents\493cd849db706e01.xml
2529425	new_york\court_of_appeals_opinions\documents\7494889be9d7e700.xml
2527416	new_york\court_of_appeals_opinions\documents\636a1414da5d7a11.xml
2555667	new_york\court_of_appeals_opinions\documents\a5fe0a6948425a17.xml
111662	michigan\court_of_appeals_opinions\documents\0e805dcf3f3430a1.xml
2556055	new_york\court_of_appeals_opinions\documents\e28085f5be892738.xml
2555347	new_york\court_of_appeals_opinions\documents\b5d5e95aa9225c55.xml
2555445	new_york\court_of_appeals_opinions\documents\9ab103f81ec48f8e.xml
9401131	new_york\court_of_appeals_opinions\documents\aa21ae1ce7e8b860.xml
9401132	new_york\court_of_appeals_opinions\documents\ae9660705b458638.xml
2003918	new_york\court_of_appeals_opinions\documents\31f2e03346d8a60b.xml
829205	michigan\supreme_court_opinions\documents\8943f1f355c5cc4b.xml
2560947	new_york\court_of_appeals_opinions\documents\22b65cf3e46d5178.xml
2161378	new_york\court_of_appeals_opinions\documents\41ab33ec7083b109.xml
2159204	michigan\supreme_court_opinions\documents\ecff7178cab9df1d.xml
2057096	michigan\supreme_court_opinions\documents\f0d0620fda215a67.xml
2281939	california\court_of_appeal_opinions\documents\f3cc8502dc5f5157.xml
5437803	new_york\supreme_court_appellate_division_opinions\documents\7b37cbd6d1851080.xml
5638432	new_york\court_of_appeals_opinions\documents\6f187bfbecd9ca68.xml
1074673	tennessee\court_opinions\documents\727f3bbe358405c1.xml
2205148	michigan\supreme_court_opinions\documents\3e1c2837038d8ab2.xml
2205361	michigan\supreme_court_opinions\documents\8108c7aaf16839a0.xml
2442390	delaware\court_opinions\documents\a9cce50448a27124.xml
2554584	pennsylvania\supreme_court_opinions\documents\391e3dfec31ea8ee.xml
2396751	north_carolina\court_opinions\documents\c635f37e9a1c4559.xml
2260803	utah\court_opinions\documents\ed2800111fe9cd06.xml
9400043	new_york\court_of_appeals_opinions\documents\4a31ff06732ec090.xml
2327325	pennsylvania\superior_court_opinions\documents\65e950b90c46574f.xml
110959	kentucky\court_opinions\documents\624c731519c06d87.xml
1994929	pennsylvania\supreme_court_opinions\documents\a7e01d983bc7d0e7.xml
2219211	michigan\supreme_court_opinions\documents\c6a2e46793e838af.xml
2221556	michigan\court_of_appeals_opinions\documents\19a037ede4143de2.xml
2223990	new_york\court_of_appeals_opinions\documents\5b8741da55a6fef8.xml
6396423	pennsylvania\supreme_court_opinions\documents\495219041ab46510.xml

Explain how you found them

I assumed that we could have the same problem that occurred with Harvard: that there may be several cases on the same page (they share the same citation) with very similar content and name. As it is not common for this to happen, I took some citations randomly and manually verified the results, which allowed me to find the problems that I present in the table above. I found the last row in the table "by luck" reviewing some data randomly.

Possible Solution

A possible solution would be to process all the clusters where we have repeated appointments and try to use other data such as the docket number to ensure that we have the correct match. If we do not have that data, try to compare the names in a more exhaustive way (we can't use date filed or court because if there are many cases on the same page that data could be the same for all clusters).

Additional Context

Add any other context about the problem here, such as screenshots, logs, or links to relevant data.

Thanks for the help in advance.

flooie commented 9 months ago

Can you append your code you used to find the multiple cases in the same page. @quevon24

flooie commented 9 months ago

Can you check if docket number is available for each of your findings? I think it would be instructive. I would have assumed it would have been checked against both in the original matching code.

quevon24 commented 9 months ago

Yesterday I talked to @flooie about what could be done to solve these problems, he suggested that we use a command to test the list of matches we have, without modifying the system, just to log if the matches in the file are the same as we would get if we ran everything directly on the server, to separate what matches from what doesn't.

The command is ready, you can see it here for now: https://gist.github.com/quevon24/bdaa1b2d860878cb77ee9424b9c64d34

After this, I was cloning some cases to test how the command worked and validate that the matches were correct.

I also tried with the cases that we know were incorrectly matched and indeed the current code made them match despite being incorrect.

As mentioned before, part of the problem is that you can have several cases on the same page, and in very particular cases those cases have similar names with almost identical opinions. Analyzing this, I thought that if this case happens we could evaluate all the possible matches and see which one is most similar of all (perhaps based on the name of the case).

There is something important to mention, not all xml files have citations or fill date, so the command as it currently stands is going to ignore these files, to address these specific files we need to add another function to find if any cases match in the system with the xml using the court, the filling date, the docket number and the content of the opinion.

So far the two patterns I have identified to detect these incorrect matches are:

There are several cases on the same page
The xml file does not have a citation

Yesterday I also updated the list with new incorrect matches

flooie commented 9 months ago

Thanks @quevon24

flooie commented 9 months ago

This is great. And one of your new examples is a great example of why this is so difficult and tricky.

new_york\court_of_appeals_opinions\documents\22b65cf3e46d5178.xml

was incorrectly matched - but when I found the opinion in our system it was a harvard merged with lawbox opinion. Those two opinions match - but the columbia data that also matches is missing the middle 95% of the opinion.

I'm not sure why - but it seems like the Columbia opinion was maybe a slip opinion - and the harvard and lawbox were the combined collection of the ruling + the supporting memorandum.

For the visual learners - the highlighted part is the memorandum - which is included in the two in our system but the columbia contains just the top and last line.

flooie commented 9 months ago

I think we need to step back and put down more information about this import/merger

3,978,306 - opinions in columbia never imported 730,514 -- imported from columbia 4,708,820 - opinion XML files

quevon24 commented 9 months ago

A few updates, i already tried to improve the process that compares case names to avoid using a set of false positive words like: united states, in re, people, etc. but so far no luck, i tried some algorithms we already have in the system like cosine similarity but the result is very similar to what we already have.

I tested a new approach with similarity but representing the case names as semantic vectors to try to not care only about the words but also the structure of the names, the problem with this approach is that it requires a trained corpus, i already tested a few ones but some words in the case names are not present there, so the possibility to make it work would be to train a corpus with the words of the case names, i don't know how long it will take to train it or if i can use some techniques like "transfer learning" to avoid starting from scratch.

I also tried comparing case names expanding the abbreviations in the name to increase the similarity between case names and the overlap, it improves the result, but sometimes is not enough to say that we are 100% sure that both case names are the same(from the file and cluster)

This part is essential in some edge cases because when we have a case with same court, same citation, same filed date, even very similar opinion content, the only thing that difference one match from another is the case name, for example with this cluster id 5834449 and this file: new_york\supreme_court_appellate_division_opinions\documents\27f1df354367d4c5.xml

Same citation: 1 A.D.2d 1035
Same court: nyappdiv
Same filed date: May 21, 1956
Almost similar content
No docket number on both sources

Content from cl: Appeal from judgment of the County Court, Westchester County, convicting appellant of the crime of burglary in the third degree, and from intermediate orders. Judgment unanimously affirmed. No opinion. No separate appeal lies from the intermediate orders, which have been reviewed on the appeal from the judgment of conviction. Present — Nolan, P. J., Wenzel, Beldock, Murphy and Kleinfeld, JJ.

Content from xml file: Appeal from judgment of the County Court, Orange County, convicting appellant of the crime of assault in the second degree, and from intermediate orders. Judgment unanimously affirmed. No opinion. No separate appeal lies from the intermediate orders, which have been reviewed on the appeal from the judgment of conviction. Present — Nolan, P.J., Wenzel, Beldock, Murphy and Kleinfeld, JJ.

It seems as if they have filled out a template and just fill in the spaces.

When comparing both text, the algorithm that compares the documents give us 90% match. Almost identical, and there must be a threshold to decide if they are similar or not.

So the only thing left is to see that the names are different

"People v. Grubbs" from cluster and "PEOPLE v. FRYER" from file.

If we only use words overlap, it will give us "people" from overlap, and according to the current algorithm, it means that both cases are the same. The solution here is to add "people" to false positive list. The problem is that probably this is not the only word that we need to add to that list, but the problem is that those words have to be found by hand.

If we compare case names using cosine similarity, it will give us more than 50% match because the only difference in both case names are the last name.

mlissner commented 9 months ago

The solution here is to add "people" to false positive list. The problem is that probably this is not the only word that we need to add to that list, but the problem is that those words have to be found by hand.

I think we're probably quite close to finding all of these kinds of words. I know Bill will have lots of thoughts about this (I feel like a passerby, so don't take me too seriously), but rather than training a corpus, why not make a simple word list by finding common words, looking at them for generic ones like "people" and then going from there?

flooie commented 9 months ago

I agree with @mlissner - I dont want to create a new corpus to do this. I think we should look at how courts do this and just replicate what they do.

quevon24 commented 9 months ago

I think I already have a list of words and text strings that might help. In some cases we only need to remove one word, but in other cases like "United states v. Foo" or something like that we need to remove the entire string(united states), we can't remove them using individual words(united or states) because it could affect other case names such as "DeSandolo v. United Airlines Inc.", it will result in: "DeSandolo v. Airlines Inc."

Words: L714

Commonwealth
People
State

Strings: L733

Office of Disciplinary Counsel
In the Interest of
In the Matter of
In Re Dependency of
In Re Detention of
In Re Disciplinary Action Against
In Re Estate of
In Re Marriage of
In Re Matter of
In Re Nomination Petition of
In Re Paternity of
In Re Petition for Disciplinary Action
In Re Petition of
In Re Reinstatement of
In Re State of
In Re
In Re State Ex Rel.
Ex Parte
Estate of
Matter of
People Ex Rel.
People in Interest of
Petition of
In re Nomination of
Matter of Adoption of
In Re Petition for Adoption of
State ex rel.
State in Interest of

quevon24 commented 9 months ago

I updated the list with new incorrect matches.

I have some updates, when i found the new incorrect matches, i found that in some cases the name is very different but the match is correct, i guess those where matched using the opinion content, for example:

In this cluster: https://www.courtlistener.com/opinion/2208256/go/ we have the case name: "Tl v. Wl"

but in the xml file delaware\court_opinions\documents\394ecae8e7788179.xml we have: "In Re Long v. Long"

We have the same court(delfamct), same citation(820 A.2d 506), same docket number(C502-04026), same filed date(2003-01-08) but the case names are completely different, I suppose that in this case they matched using the content of the opinion.

I am going to implement what I mentioned above to take into account the content of the opinions when verifying that the cluster matches the file.

I'm going to continue reverse engineering the list to see how else i can verify the matches.

quevon24 commented 7 months ago

There are only ~115k rows left to validate, among these are the incorrect matches.

Just as requested, I made some small graphs to have a better visualization of what needs to be validated to merge in courtlistener.

remaining columbia matches graphs.ods

Many of those that need to be validated are from: lactapp and nyappdiv Most are single opinions in the file. The highest concentration is between 1994-2011

mlissner commented 7 months ago

Interesting. I didn't see any trends that make life easier here, but it sounds like you're making progress anyway.

mlissner commented 1 month ago

Worth revisiting this one?

freelawproject / courtlistener