Synonym file needs more entries

morninj commented 10 years ago

When I query Clapper v. Amnesty Intern. USA, it should return this, but instead I get no results.

mlissner commented 10 years ago

Can you post a link to your query please? On Aug 1, 2014 8:13 PM, "Joseph Mornin" notifications@github.com wrote:

When I query Clapper v. Amnesty Intern. USA, it should return this https://www.courtlistener.com/scotus/5d9s/clapper-v-amnesty-international-usa/?q=&order_by=score+desc&case_name=Clapper+v.+Amnesty&stat_Precedential=on, but instead I get no results.

— Reply to this email directly or view it on GitHub https://github.com/freelawproject/courtlistener/issues/273.

morninj commented 10 years ago

https://www.courtlistener.com/?q=&case_name=Clapper%20v.%20Amnesty%20Intern.%20USA&stat_Precedential=on&order_by=score+desc

mlissner commented 10 years ago

We have a very old synonym file here:

https://github.com/freelawproject/courtlistener/blob/master/Solr/conf/lang/synonyms_en.txt

It's got the barest minimum of items, but we could go a long way by adding items to it.

If you look at it, it has a bunch of examples of how to set up synonyms. The big question I have is if there are any lists already out there.

@emasters, did you ever make something like this?

morninj commented 10 years ago

Would be great to support Bluebook T6: https://i.imgur.com/sTklEaD.jpg

nowherenearithaca commented 10 years ago

Newbie question: do lawyers stick to those fairly consistently (since whenever BlueBook originally came out)?

morninj commented 10 years ago

Yes—at least for case names in citations (as opposed to case names in sentences). This matters because someone might query CourtListener by pasting a citation.

mlissner commented 10 years ago

T6 is apparently now T11. Here's a digital form of it:

https://law.resource.org/pub/us/code/blue/IndigoBook.html#T11

That'd be a good starting point. Still, I feel like there are versions of this already floating around some place....

mlissner commented 10 years ago

Another person I remember talking about synonym files was @waldoj. Did you ever have such a file, Waldo?

waldoj commented 10 years ago

Nope, but it's on my list of things I'd like to create as an @opendata project. I think there are enough State Decoded implementations to have a pretty good corpus of extracted terms and definitions to work with now, too.

mlissner commented 10 years ago

waldoj commented 10 years ago

:+1:

mlissner commented 10 years ago

Looks like the synonym work upstream is basically done. The remainder of this issue is probably therefore:

[x] Pull in the work from upstream
[x] Make sure the synonyms from T6 (linked above) are incorporated
[x] Comb over the synonym file and make sure it doesn't have any bad entries that'll throw our system totally out of whack (I've found a few just skimming).

mlissner commented 10 years ago

Oh yeah, and then:

[x] deploy the thing.

mlissner commented 8 years ago

I just did this today, so this is finally getting fixed once I pull it and things get reindexed. The source for this data is mostly from the Indigo Book, which has many tables of abbreviations, such as:

U.S. States and Other Jurisdictions
Services & Publishers
Legislative Documents
Treaties
Arbitral Reporters
Intergovernmental Organizations
Court names
Titles of judges and other people
Case name abbreviations
Geographical terms
Document subdivistions (pargaph, section, etc)
Explanatory phrases
Institutions
Publishing terms
Month names
Common words in periodical names

On top of that, I added a few things:

Numbers from 1-20
Units
And any extra things I saw missing (like trans didn't have a mapping to transgender)

From there I did the following:

Remove duplicates (of which there were many)
Removed phrases (Solr is bad at this)
Removed anything with a period or apostrophe in the middle, like, U.S., because those things get split by Solr anyway.
Cleaned up a bunch of items that had brackets, like trans[lator, lation], trans. In the case that the expanded word lists were semantically different, I made them into mappings. Else, I made them into synonyms. For example, here's what trans maps to:
```
trans => translation,translator,transgender
```
Whereas something else might just be:
```
assemb,assembly,assemblyman,assemblywoman,assemblymember
```
Because they're all essentially the same.
Eliminated one-letter abbreviations (they're not likely to be useful)
Removed real words that'd cause trouble. For example, we wouldn't want every search for cat to turn up results for category.

All in all, I think it's a fine list. Definitely a conservative one, which lawyers like, but also one that should make lots of searches (especially those involving abbreviations) work better.

mlissner commented 8 years ago

This is now deployed, and @morninj, your query is definitely improved: https://www.courtlistener.com/?q=Clapper+v.+Amnesty+Intern.+USA&type=o&order_by=score+desc&stat_Precedential=on

I'm loving it, actually. It's a subtle but big improvement.

freelawproject / courtlistener

Synonym file needs more entries #273