freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
550 stars 151 forks source link

Issue with synonyms on search #4671

Closed v-anne closed 1 week ago

v-anne commented 1 week ago

I attempted to search for this case with the following query.

q=&type=r&order_by=dateFiled%20desc&case_name=eeoc&assigned_to=joseph&court=lawd

It does not show up. I am confused about why given I think cl/search/elasticsearch_files/synonyms_en.txt covers the EEOC / Equal Employment Opportunity Commission as synonyms.

mlissner commented 1 week ago

Looks like we made the decision not to do magic in case name searches here: https://github.com/freelawproject/courtlistener/pull/4410

I don't know what that synonym file is used for, but @albertisfu will when he gets a sec to catch up (he's been on vacation for three weeks).

albertisfu commented 1 week ago

Looks like we made the decision not to do magic in case name searches here: https://github.com/freelawproject/courtlistener/pull/4410

Yes, that's correct. We decided to use the exact version of the caseName field, which doesn't consider stemming or synonyms.

Synonyms is available for all the other text fields. For instance:

This search: https://www.courtlistener.com/?q=eeoc&type=r&order_by=dateFiled%20desc&assigned_to=joseph&court=lawd

Matches synonyms in the document's plain text and description.

mlissner commented 1 week ago

Cool, thanks Alberto. Do you know what the synonym file in the CL codebase is used for? Is it just to make synonyms work in dev?

albertisfu commented 1 week ago

Yeah, the file is synonyms_en.txt, and yes it only works for development. To update synonyms in production, that file needs to be loaded into the ES cluster so that ES recognizes it.

Last time, we needed to create a new config map for the file and then restart each node in the cluster.

However, it seems possible to use Kibana to simply reload the analyzers and avoid restarting the nodes: https://github.com/freelawproject/courtlistener/issues/3089#issuecomment-1703399991