freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
531 stars 147 forks source link

'Related Case Law' section is not being shown on the Opinion page when ES is enabled. #4305

Open albertisfu opened 1 month ago

albertisfu commented 1 month ago

While working on #4211, I noticed that the related opinions in the 'Related Case Law' section on the Opinion detail page are not being shown in the ES version.

Example: https://www.courtlistener.com/opinion/1472349/roe-v-wade/?type=o&q=Roe%20v.%20Wade&type=o&order_by=score%20desc

If you look at it while logged off (Solr version), you can see the 'Related Case Law' sidebar with opinions. However, if you view the same opinion while logged in as staff (ES version), the related opinions are not shown.

I believe this could be an issue related to caching, or it might be that the more_like_this query isn't matching the right results.

@mlissner We could start by checking if deleting the mlt-cluster-es cache entry for this cluster helps:"

from django.core.cache import caches

cache = caches["db_cache"]
mlt_cache_key = f"mlt-cluster-es:{1472349}"
cache.delete(mlt_cache_key)

Additionally, we can run this query in Kibana to ensure it’s returning the correct results.

GET /opinion_index/_search?pretty

{
   "query":{
      "bool":{
         "should":[
            {
               "has_child":{
                  "type":"opinion",
                  "score_mode":"max",
                  "query":{
                     "bool":{
                        "filter":[
                           {
                              "terms":{
                                 "status.raw":[
                                    "Published"
                                 ]
                              }
                           }
                        ],
                        "should":[
                           {
                              "query_string":{
                                 "fields":[
                                    "court",
                                    "court_id",
                                    "citation",
                                    "judge",
                                    "caseNameFull",
                                    "caseName",
                                    "status",
                                    "suitNature",
                                    "attorney",
                                    "procedural_history",
                                    "posture",
                                    "syllabus",
                                    "type",
                                    "text",
                                    "caseName",
                                    "docketNumber",
                                    "type",
                                    "text",
                                    "caseName",
                                    "docketNumber",
                                    "text^1.0",
                                    "type^1.0",
                                    "caseName^4.0",
                                    "docketNumber^2.0"
                                 ],
                                 "query":"related:1472349",
                                 "quote_field_suffix":".exact",
                                 "default_operator":"AND",
                                 "tie_breaker":0.3,
                                 "fuzziness":2
                              }
                           },
                           {
                              "query_string":{
                                 "fields":[
                                    "court",
                                    "court_id",
                                    "citation",
                                    "judge",
                                    "caseNameFull",
                                    "caseName",
                                    "status",
                                    "suitNature",
                                    "attorney",
                                    "procedural_history",
                                    "posture",
                                    "syllabus",
                                    "type",
                                    "text",
                                    "caseName",
                                    "docketNumber",
                                    "type",
                                    "text",
                                    "caseName",
                                    "docketNumber",
                                    "text^1.0",
                                    "type^1.0",
                                    "caseName^4.0",
                                    "docketNumber^2.0"
                                 ],
                                 "query":"related:1472349",
                                 "quote_field_suffix":".exact",
                                 "default_operator":"AND",
                                 "type":"phrase",
                                 "fuzziness":2
                              }
                           },
                           {
                              "more_like_this":{
                                 "fields":[
                                    "court",
                                    "court_id",
                                    "citation",
                                    "judge",
                                    "caseNameFull",
                                    "caseName",
                                    "status",
                                    "suitNature",
                                    "attorney",
                                    "procedural_history",
                                    "posture",
                                    "syllabus",
                                    "type",
                                    "text",
                                    "caseName",
                                    "docketNumber",
                                    "type",
                                    "text",
                                    "caseName",
                                    "docketNumber"
                                 ],
                                 "like":[
                                    {
                                       "_id":"o_1472349"
                                    }
                                 ],
                                 "min_term_freq":1,
                                 "max_query_terms":12
                              }
                           }
                        ],
                        "minimum_should_match":1
                     }
                  },
                  "inner_hits":{
                     "name":"filter_query_inner_opinion",
                     "size":21,
                     "_source":{
                        "excludes":[
                           "text"
                        ]
                     },
                     "highlight":{
                        "fields":{
                           "text":{
                              "type":"fvh",
                              "matched_fields":[
                                 "text",
                                 "text.exact"
                              ],
                              "fragment_size":100,
                              "no_match_size":500,
                              "number_of_fragments":1,
                              "pre_tags":[
                                 "<mark>"
                              ],
                              "post_tags":[
                                 "</mark>"
                              ]
                           }
                        }
                     }
                  }
               }
            },
            {
               "bool":{
                  "filter":[
                     {
                        "terms":{
                           "status.raw":[
                              "Published"
                           ]
                        }
                     },
                     {
                        "match":{
                           "cluster_child":"opinion_cluster"
                        }
                     }
                  ],
                  "should":[
                     {
                        "query_string":{
                           "fields":[
                              "court",
                              "court_id",
                              "citation",
                              "judge",
                              "caseNameFull",
                              "caseName",
                              "status",
                              "suitNature",
                              "attorney",
                              "procedural_history",
                              "posture",
                              "syllabus",
                              "type",
                              "text",
                              "caseName",
                              "docketNumber",
                              "type",
                              "text",
                              "caseName",
                              "docketNumber",
                              "caseName^4.0",
                              "docketNumber^2.0"
                           ],
                           "query":"related:1472349",
                           "quote_field_suffix":".exact",
                           "default_operator":"AND",
                           "tie_breaker":0.3,
                           "fuzziness":2
                        }
                     },
                     {
                        "query_string":{
                           "fields":[
                              "court",
                              "court_id",
                              "citation",
                              "judge",
                              "caseNameFull",
                              "caseName",
                              "status",
                              "suitNature",
                              "attorney",
                              "procedural_history",
                              "posture",
                              "syllabus",
                              "type",
                              "text",
                              "caseName",
                              "docketNumber",
                              "type",
                              "text",
                              "caseName",
                              "docketNumber",
                              "caseName^4.0",
                              "docketNumber^2.0"
                           ],
                           "query":"related:1472349",
                           "quote_field_suffix":".exact",
                           "default_operator":"AND",
                           "type":"phrase",
                           "fuzziness":2
                        }
                     }
                  ],
                  "minimum_should_match":1
               }
            }
         ]
      }
   },
   "sort":[
      {
         "_score":{
            "order":"desc"
         }
      }
   ],
   "highlight":{
      "fields":{
         "caseName":{
            "type":"fvh",
            "matched_fields":[
               "caseName",
               "caseName.exact"
            ],
            "fragment_size":0,
            "no_match_size":0,
            "number_of_fragments":0,
            "pre_tags":[
               "<mark>"
            ],
            "post_tags":[
               "</mark>"
            ]
         },
         "citation":{
            "type":"fvh",
            "matched_fields":[
               "citation",
               "citation.exact"
            ],
            "fragment_size":0,
            "no_match_size":0,
            "number_of_fragments":0,
            "pre_tags":[
               "<mark>"
            ],
            "post_tags":[
               "</mark>"
            ]
         },
         "court_citation_string":{
            "type":"fvh",
            "matched_fields":[
               "court_citation_string",
               "court_citation_string.exact"
            ],
            "fragment_size":0,
            "no_match_size":0,
            "number_of_fragments":0,
            "pre_tags":[
               "<mark>"
            ],
            "post_tags":[
               "</mark>"
            ]
         },
         "docketNumber":{
            "type":"fvh",
            "matched_fields":[
               "docketNumber",
               "docketNumber.exact"
            ],
            "fragment_size":0,
            "no_match_size":0,
            "number_of_fragments":0,
            "pre_tags":[
               "<mark>"
            ],
            "post_tags":[
               "</mark>"
            ]
         },
         "suitNature":{
            "type":"fvh",
            "matched_fields":[
               "suitNature",
               "suitNature.exact"
            ],
            "fragment_size":0,
            "no_match_size":0,
            "number_of_fragments":0,
            "pre_tags":[
               "<mark>"
            ],
            "post_tags":[
               "</mark>"
            ]
         }
      }
   },
   "size":5,
   "track_total_hits":false,
   "from":0,
   "_source":{
      "excludes":[

      ]
   }
}
mlissner commented 1 month ago

Clearing the cache didn't help.

The query you requested returned:

{
  "took": 51,
  "timed_out": false,
  "_shards": {
    "total": 30,
    "successful": 30,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "max_score": null,
    "hits": []
  }
}

Hm... :/

albertisfu commented 1 month ago

It's strange that the query above didn't return any results.

I cloned the cluster 1472349 from CL, along with 5 of their related clusters shown by Solr. After indexing them into ES locally, the cluster displayed the related clusters using the query above, so something else might be happening.

Just to confirm, was the query executed in the opinion_index?

I also simplified the query by removing everything that's not required. Can we test it, please?

GET /opinion_index/_search?pretty

{
   "query":{
      "bool":{
         "should":[
            {
               "has_child":{
                  "type":"opinion",
                  "score_mode":"max",
                  "query":{
                     "bool":{
                        "filter":[
                           {
                              "terms":{
                                 "status.raw":[
                                    "Published"
                                 ]
                              }
                           }
                        ],
                        "should":[
                           {
                              "more_like_this":{
                                 "fields":[
                                    "court",
                                    "court_id",
                                    "citation",
                                    "judge",
                                    "caseNameFull",
                                    "caseName",
                                    "status",
                                    "suitNature",
                                    "attorney",
                                    "procedural_history",
                                    "posture",
                                    "syllabus",
                                    "type",
                                    "text",
                                    "caseName",
                                    "docketNumber"
                                 ],
                                 "like":[
                                    {
                                       "_id":"o_1472349"
                                    }
                                 ],
                                 "min_term_freq":1,
                                 "max_query_terms":12
                              }
                           }
                        ],
                        "minimum_should_match":1
                     }
                  },
                  "inner_hits":{
                     "name":"filter_query_inner_opinion",
                     "size":0,
                     "_source":{
                        "excludes":[
                           "text"
                        ]
                     }
                  }
               }
            }
         ]
      }
   },
   "sort":[
      {
         "_score":{
            "order":"desc"
         }
      }
   ],
   "size":5,
   "track_total_hits":false,
   "from":0,
   "_source":{
      "excludes":[

      ]
   }
}

And this other version is an optimized version that also incorporates additional parameters for the MLT query, which are used in the Solr version.

{
   "query":{
      "bool":{
         "should":[
            {
               "bool":{
                  "filter":[
                     {
                        "terms":{
                           "status.raw":[
                              "Published"
                           ]
                        }
                     },
                     {
                        "match":{
                           "cluster_child":"opinion"
                        }
                     },
                     {
                              "more_like_this":{
                                 "fields":[
                                    "court",
                                    "court_id",
                                    "citation",
                                    "judge",
                                    "caseNameFull",
                                    "caseName",
                                    "status",
                                    "suitNature",
                                    "attorney",
                                    "procedural_history",
                                    "posture",
                                    "syllabus",
                                    "type",
                                    "text",
                                    "caseName",
                                    "docketNumber"
                                 ],
                                 "like":[
                                    {
                                       "_id":"o_1472349"
                                    }
                                 ],
                                 "min_term_freq":1,
                                 "max_query_terms":10,
                                 "min_word_length": 3,
                                 "max_word_length": 0,
                                 "max_doc_freq": 1000
                              }
                    }
                  ]
               }
            }
         ]
      }
   },
   "sort":[
      {
         "_score":{
            "order":"desc"
         }
      }
   ],
   "size":5,
   "from":0,
   "_source":{
      "excludes":[

      ]
   }
}
mlissner commented 1 month ago

Hm, both of those queries gave the minimal non-response like before?

albertisfu commented 4 weeks ago

yeah, now I was able to reproduce the issue locally. After indexing a couple of thousand documents, these queries stopped retrieving results. It seems related to the document frequency and terms queried. Another problem I can see is that related queries can be quite restrictive now since they include all the fields that in Solr were indexed within a single field that contained all the content. So this can also be an issue. I'm testing tuning the query, and I'll also test with a field that concentrates all the document fields using the copy_to parameter and re-indexing.

albertisfu commented 4 weeks ago

Ok I did many tests around this issue. Initially, I thought the problem was directly related to field fragmentation in the MTL query, since in the Solr version it only looked up a single field. Now, the MTL fields are:

 "court",
  "court_id",
  "citation",
  "judge",
  "caseNameFull",
  "caseName",
  "status",
  "suitNature",
  "attorney",
  "procedural_history",
  "posture",
  "syllabus",
  "type",
  "text",
  "caseName",
  "docketNumber"

I added a field that combines all of these fields into a single one using the copy_to method, so this new field can be populated by doing a re_index. However, even after adding this new field and using it to perform the more_like_this query, the query continued to behave incorrectly. It matched many documents that seemed irrelevant or didn't match any at all, according to the settings used to select relevant terms.

Then I realized that the issue was directly related to the analyzer used to index the content in this field and to perform the MTL query. Using text_en_splitting_cl as the analyzer at index time removed duplicates, which impacted the selection of relevant terms since one of the settings is related to the frequency a term appears in the content to be selected.

Also, using the default analyzer for search (search_analyzer) introduced more noise by considering synonyms at search time.

After using the exact analyzers, the MLT used more relevant terms for the query, as can be inspected using a profiling query, and the related results seem more relevant with better quality.

Then I realized that the problem using the MLT query with many fields could also be related to the analyzers and not due to field fragmentation. So I tweaked the fields to use their exact versions and also changed the query analyzer to "search_analyzer_exact". It was also required to add stop words directly to the query as they're not selected as relevant terms. Because we're now using the exact version of the fields, stop words are not removed.

In this query version, I got very similar results to the combined_fields approach. So perhaps the combined approach might not be required after all. But we could test it to compare the quality of results using production data, and the performance could be better in the combined_fields approach. I added more details in the PR if we want to test this approach too: https://github.com/freelawproject/courtlistener/pull/4316

For now, we can test this fixed query that uses the right fields and analyzer and also includes stop words. We can analyze the quality of the results, and if it seems okay and isn't too slow, maybe we can go with this approach without requiring the implementation of the #4316 approach.

{
   "profile":true,
   "query":{
      "bool":{
         "should":[
            {
               "bool":{
                  "filter":[
                     {
                        "terms":{
                           "status.raw":[
                              "Published"
                           ]
                        }
                     },
                     {
                        "match":{
                           "cluster_child":"opinion"
                        }
                     },
                     {
                        "more_like_this":{
                           "like":[
                              {
                                 "_id":"o_1472349"
                              }
                           ],
                           "fields":[
                              "court.exact",
                              "court_id.exact",
                              "citation.exact",
                              "judge.exact",
                              "caseNameFull.exact",
                              "caseName.exact",
                              "status.exact",
                              "suitNature.exact",
                              "attorney.exact",
                              "procedural_history.exact",
                              "posture.exact",
                              "syllabus.exact",
                              "type.exact",
                              "text.exact",
                              "caseName.exact",
                              "docketNumber.exact"
                           ],
                           "min_term_freq":5,
                           "max_query_terms":10,
                           "min_word_length":3,
                           "max_word_length":0,
                           "max_doc_freq":1000,
                           "minimum_should_match":"30%",
                           "analyzer":"search_analyzer_exact",
                           "stop_words":[
                              "a",
                              "about",
                              "above",
                              "after",
                              "again",
                              "against",
                              "ain",
                              "all",
                              "am",
                              "an",
                              "and",
                              "any",
                              "are",
                              "aren",
                              "arent",
                              "as",
                              "at",
                              "be",
                              "because",
                              "been",
                              "before",
                              "being",
                              "below",
                              "between",
                              "both",
                              "but",
                              "by",
                              "can",
                              "couldn",
                              "couldnt",
                              "d",
                              "did",
                              "didn",
                              "didnt",
                              "do",
                              "does",
                              "doesn",
                              "doesnt",
                              "doing",
                              "don",
                              "dont",
                              "down",
                              "during",
                              "each",
                              "few",
                              "for",
                              "from",
                              "further",
                              "had",
                              "hadn",
                              "hadnt",
                              "has",
                              "hasn",
                              "hasnt",
                              "have",
                              "haven",
                              "havent",
                              "having",
                              "he",
                              "her",
                              "here",
                              "hers",
                              "herself",
                              "him",
                              "himself",
                              "his",
                              "how",
                              "i",
                              "if",
                              "in",
                              "into",
                              "is",
                              "isn",
                              "isnt",
                              "it",
                              "its",
                              "its",
                              "itself",
                              "just",
                              "ll",
                              "m",
                              "ma",
                              "me",
                              "mightn",
                              "mightnt",
                              "more",
                              "most",
                              "mustn",
                              "mustnt",
                              "my",
                              "myself",
                              "needn",
                              "neednt",
                              "no",
                              "nor",
                              "not",
                              "now",
                              "o",
                              "of",
                              "off",
                              "on",
                              "once",
                              "only",
                              "or",
                              "other",
                              "our",
                              "ours",
                              "ourselves",
                              "out",
                              "over",
                              "own",
                              "re",
                              "s",
                              "same",
                              "shan",
                              "shant",
                              "she",
                              "shes",
                              "should",
                              "shouldve",
                              "shouldn",
                              "shouldnt",
                              "so",
                              "some",
                              "such",
                              "t",
                              "than",
                              "that",
                              "thatll",
                              "the",
                              "their",
                              "theirs",
                              "them",
                              "themselves",
                              "then",
                              "there",
                              "these",
                              "they",
                              "this",
                              "those",
                              "through",
                              "to",
                              "too",
                              "under",
                              "until",
                              "up",
                              "ve",
                              "very",
                              "was",
                              "wasn",
                              "wasnt",
                              "we",
                              "were",
                              "weren",
                              "werent",
                              "what",
                              "when",
                              "where",
                              "which",
                              "while",
                              "who",
                              "whom",
                              "why",
                              "will",
                              "with",
                              "won",
                              "wont",
                              "wouldn",
                              "wouldnt",
                              "y",
                              "you",
                              "youd",
                              "youll",
                              "youre",
                              "youve",
                              "your",
                              "yours",
                              "yourself",
                              "yourselves"
                           ]
                        }
                     }
                  ]
               }
            }
         ]
      }
   },
   "sort":[
      {
         "_score":{
            "order":"desc"
         }
      }
   ],
   "size":5,
   "from":0,
   "_source":{
      "excludes":[
         "text"
      ]
   }
}
mlissner commented 4 weeks ago

Here's the response:

results1.json

Instant, but I haven't looked at the quality.

albertisfu commented 4 weeks ago

:/ well actually the response didn't return any results. The content in the response is the profile of the query, and it shows that nothing was queried.

"query": [
              {
                "type": "MatchNoDocsQuery",
                "description": """MatchNoDocsQuery("pure negative BooleanQuery")""",
                "time_in_nanos": 3315,
                "breakdown": {
                  "set_min_competitive_score_count": 0,
                  "match_count": 0,
                  "shallow_advance_count": 0,
                  "set_min_competitive_score": 0,
                  "next_doc": 0,
                  "match": 0,
                  "next_doc_count": 0,
                  "score_count": 0,
                  "compute_max_score_count": 0,
                  "compute_max_score": 0,
                  "advance": 0,
                  "advance_count": 0,
                  "count_weight_count": 0,
                  "score": 0,
                  "build_scorer_count": 11,
                  "create_weight": 550,
                  "shallow_advance": 0,
                  "count_weight": 0,
                  "create_weight_count": 1,
                  "build_scorer": 2765
                }
              }
            ]

I think the query requires more tuning according to the index size, we can try to increment "max_doc_freq":1000, to 10000 or 100000 maybe there are many documents matched.

And perhaps more debugging will be required to continue tuning the query. It'd bee easy to debug this issue having access to perform queries directly to the cluster, hopefully soon!

mlissner commented 4 weeks ago

Ah yes, my bad. I just increased max_doc_freq to 10000000, but continued getting no results. Let's put this on hold until we have access and you can debug more easily.