Open albertisfu opened 3 months ago
Clearing the cache didn't help.
The query you requested returned:
{
"took": 51,
"timed_out": false,
"_shards": {
"total": 30,
"successful": 30,
"skipped": 0,
"failed": 0
},
"hits": {
"max_score": null,
"hits": []
}
}
Hm... :/
It's strange that the query above didn't return any results.
I cloned the cluster 1472349
from CL, along with 5 of their related clusters shown by Solr. After indexing them into ES locally, the cluster displayed the related clusters using the query above, so something else might be happening.
Just to confirm, was the query executed in the opinion_index
?
I also simplified the query by removing everything that's not required. Can we test it, please?
GET /opinion_index/_search?pretty
{
"query":{
"bool":{
"should":[
{
"has_child":{
"type":"opinion",
"score_mode":"max",
"query":{
"bool":{
"filter":[
{
"terms":{
"status.raw":[
"Published"
]
}
}
],
"should":[
{
"more_like_this":{
"fields":[
"court",
"court_id",
"citation",
"judge",
"caseNameFull",
"caseName",
"status",
"suitNature",
"attorney",
"procedural_history",
"posture",
"syllabus",
"type",
"text",
"caseName",
"docketNumber"
],
"like":[
{
"_id":"o_1472349"
}
],
"min_term_freq":1,
"max_query_terms":12
}
}
],
"minimum_should_match":1
}
},
"inner_hits":{
"name":"filter_query_inner_opinion",
"size":0,
"_source":{
"excludes":[
"text"
]
}
}
}
}
]
}
},
"sort":[
{
"_score":{
"order":"desc"
}
}
],
"size":5,
"track_total_hits":false,
"from":0,
"_source":{
"excludes":[
]
}
}
And this other version is an optimized version that also incorporates additional parameters for the MLT query, which are used in the Solr version.
{
"query":{
"bool":{
"should":[
{
"bool":{
"filter":[
{
"terms":{
"status.raw":[
"Published"
]
}
},
{
"match":{
"cluster_child":"opinion"
}
},
{
"more_like_this":{
"fields":[
"court",
"court_id",
"citation",
"judge",
"caseNameFull",
"caseName",
"status",
"suitNature",
"attorney",
"procedural_history",
"posture",
"syllabus",
"type",
"text",
"caseName",
"docketNumber"
],
"like":[
{
"_id":"o_1472349"
}
],
"min_term_freq":1,
"max_query_terms":10,
"min_word_length": 3,
"max_word_length": 0,
"max_doc_freq": 1000
}
}
]
}
}
]
}
},
"sort":[
{
"_score":{
"order":"desc"
}
}
],
"size":5,
"from":0,
"_source":{
"excludes":[
]
}
}
Hm, both of those queries gave the minimal non-response like before?
yeah, now I was able to reproduce the issue locally. After indexing a couple of thousand documents, these queries stopped retrieving results. It seems related to the document frequency and terms queried. Another problem I can see is that related queries can be quite restrictive now since they include all the fields that in Solr were indexed within a single field that contained all the content. So this can also be an issue. I'm testing tuning the query, and I'll also test with a field that concentrates all the document fields using the copy_to
parameter and re-indexing.
Ok I did many tests around this issue. Initially, I thought the problem was directly related to field fragmentation in the MTL query, since in the Solr version it only looked up a single field. Now, the MTL fields are:
"court",
"court_id",
"citation",
"judge",
"caseNameFull",
"caseName",
"status",
"suitNature",
"attorney",
"procedural_history",
"posture",
"syllabus",
"type",
"text",
"caseName",
"docketNumber"
I added a field that combines all of these fields into a single one using the copy_to
method, so this new field can be populated by doing a re_index
.
However, even after adding this new field and using it to perform the more_like_this
query, the query continued to behave incorrectly. It matched many documents that seemed irrelevant or didn't match any at all, according to the settings used to select relevant terms.
Then I realized that the issue was directly related to the analyzer used to index the content in this field and to perform the MTL query. Using text_en_splitting_cl
as the analyzer at index time removed duplicates, which impacted the selection of relevant terms since one of the settings is related to the frequency a term appears in the content to be selected.
Also, using the default analyzer for search (search_analyzer
) introduced more noise by considering synonyms at search time.
After using the exact analyzers, the MLT used more relevant terms for the query, as can be inspected using a profiling query, and the related results seem more relevant with better quality.
Then I realized that the problem using the MLT query with many fields could also be related to the analyzers and not due to field fragmentation. So I tweaked the fields to use their exact
versions and also changed the query analyzer to "search_analyzer_exact". It was also required to add stop words directly to the query as they're not selected as relevant terms. Because we're now using the exact version of the fields, stop words are not removed.
In this query version, I got very similar results to the combined_fields
approach. So perhaps the combined approach might not be required after all. But we could test it to compare the quality of results using production data, and the performance could be better in the combined_fields
approach. I added more details in the PR if we want to test this approach too:
https://github.com/freelawproject/courtlistener/pull/4316
For now, we can test this fixed query that uses the right fields and analyzer and also includes stop words. We can analyze the quality of the results, and if it seems okay and isn't too slow, maybe we can go with this approach without requiring the implementation of the #4316 approach.
{
"profile":true,
"query":{
"bool":{
"should":[
{
"bool":{
"filter":[
{
"terms":{
"status.raw":[
"Published"
]
}
},
{
"match":{
"cluster_child":"opinion"
}
},
{
"more_like_this":{
"like":[
{
"_id":"o_1472349"
}
],
"fields":[
"court.exact",
"court_id.exact",
"citation.exact",
"judge.exact",
"caseNameFull.exact",
"caseName.exact",
"status.exact",
"suitNature.exact",
"attorney.exact",
"procedural_history.exact",
"posture.exact",
"syllabus.exact",
"type.exact",
"text.exact",
"caseName.exact",
"docketNumber.exact"
],
"min_term_freq":5,
"max_query_terms":10,
"min_word_length":3,
"max_word_length":0,
"max_doc_freq":1000,
"minimum_should_match":"30%",
"analyzer":"search_analyzer_exact",
"stop_words":[
"a",
"about",
"above",
"after",
"again",
"against",
"ain",
"all",
"am",
"an",
"and",
"any",
"are",
"aren",
"arent",
"as",
"at",
"be",
"because",
"been",
"before",
"being",
"below",
"between",
"both",
"but",
"by",
"can",
"couldn",
"couldnt",
"d",
"did",
"didn",
"didnt",
"do",
"does",
"doesn",
"doesnt",
"doing",
"don",
"dont",
"down",
"during",
"each",
"few",
"for",
"from",
"further",
"had",
"hadn",
"hadnt",
"has",
"hasn",
"hasnt",
"have",
"haven",
"havent",
"having",
"he",
"her",
"here",
"hers",
"herself",
"him",
"himself",
"his",
"how",
"i",
"if",
"in",
"into",
"is",
"isn",
"isnt",
"it",
"its",
"its",
"itself",
"just",
"ll",
"m",
"ma",
"me",
"mightn",
"mightnt",
"more",
"most",
"mustn",
"mustnt",
"my",
"myself",
"needn",
"neednt",
"no",
"nor",
"not",
"now",
"o",
"of",
"off",
"on",
"once",
"only",
"or",
"other",
"our",
"ours",
"ourselves",
"out",
"over",
"own",
"re",
"s",
"same",
"shan",
"shant",
"she",
"shes",
"should",
"shouldve",
"shouldn",
"shouldnt",
"so",
"some",
"such",
"t",
"than",
"that",
"thatll",
"the",
"their",
"theirs",
"them",
"themselves",
"then",
"there",
"these",
"they",
"this",
"those",
"through",
"to",
"too",
"under",
"until",
"up",
"ve",
"very",
"was",
"wasn",
"wasnt",
"we",
"were",
"weren",
"werent",
"what",
"when",
"where",
"which",
"while",
"who",
"whom",
"why",
"will",
"with",
"won",
"wont",
"wouldn",
"wouldnt",
"y",
"you",
"youd",
"youll",
"youre",
"youve",
"your",
"yours",
"yourself",
"yourselves"
]
}
}
]
}
}
]
}
},
"sort":[
{
"_score":{
"order":"desc"
}
}
],
"size":5,
"from":0,
"_source":{
"excludes":[
"text"
]
}
}
:/ well actually the response didn't return any results. The content in the response is the profile of the query, and it shows that nothing was queried.
"query": [
{
"type": "MatchNoDocsQuery",
"description": """MatchNoDocsQuery("pure negative BooleanQuery")""",
"time_in_nanos": 3315,
"breakdown": {
"set_min_competitive_score_count": 0,
"match_count": 0,
"shallow_advance_count": 0,
"set_min_competitive_score": 0,
"next_doc": 0,
"match": 0,
"next_doc_count": 0,
"score_count": 0,
"compute_max_score_count": 0,
"compute_max_score": 0,
"advance": 0,
"advance_count": 0,
"count_weight_count": 0,
"score": 0,
"build_scorer_count": 11,
"create_weight": 550,
"shallow_advance": 0,
"count_weight": 0,
"create_weight_count": 1,
"build_scorer": 2765
}
}
]
I think the query requires more tuning according to the index size, we can try to increment "max_doc_freq":1000, to 10000 or 100000 maybe there are many documents matched.
And perhaps more debugging will be required to continue tuning the query. It'd bee easy to debug this issue having access to perform queries directly to the cluster, hopefully soon!
Ah yes, my bad. I just increased max_doc_freq
to 10000000, but continued getting no results. Let's put this on hold until we have access and you can debug more easily.
This one would require to be completed before we shut down Solr.
However we need to debug queries using the case law production index in the ES cluster.
Currently, access to Kibana and the ES endpoint for developers is not working. I asked Sergei, and he can’t access either, so something might be wrong with the Kibana/ES interface.
Didn't @flooie sort of fix this in his new opinion page that he's landing shortly?
I sent a message to Ramiro to see if he can get kibana going again.
I put this one onto the next sprint for you, @albertisfu, so we can think about it then.
While working on #4211, I noticed that the related opinions in the 'Related Case Law' section on the Opinion detail page are not being shown in the ES version.
Example: https://www.courtlistener.com/opinion/1472349/roe-v-wade/?type=o&q=Roe%20v.%20Wade&type=o&order_by=score%20desc
If you look at it while logged off (Solr version), you can see the 'Related Case Law' sidebar with opinions. However, if you view the same opinion while logged in as staff (ES version), the related opinions are not shown.
I believe this could be an issue related to caching, or it might be that the
more_like_this
query isn't matching the right results.@mlissner We could start by checking if deleting the
mlt-cluster-es cache
entry for this cluster helps:"Additionally, we can run this query in Kibana to ensure it’s returning the correct results.
GET /opinion_index/_search?pretty