fix(elasticsearch): Remove size parameter from auxiliary queries

ERosendo commented 1 week ago

The test_multiple_alerts_email_hits_limit_per_alert test failed because our logic mistakenly included nested documents and failed to accurately distinguish docket-only hits when generating email alerts.

Upon inspecting the helper function responsible for processing individual alert hits, I noticed that the identification of docket-only hits relied on the results of auxiliary queries.

When comparing the main and auxiliary docket queries, I noticed some inconsistencies. Results were not always returned in the same order, and the auxiliary query occasionally omitted elements included in the main query. The auxiliary query was missing elements because the SCHEDULED_ALERT_HITS_LIMIT setting restricted the number of results and elements were ordered differently.

This PR addresses the issue by removing the size parameter from auxiliary queries. This ensures that these queries always return all matching IDs, regardless of their order, allowing for accurate identification of docket-only hits and cross-object hits.

sentry-io[bot] commented 1 week ago

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: cl/lib/elasticsearch_utils.py

Function	Unhandled Issue
`do_es_sweep_alert_query`	ApiError: N/A cl.lib.elasticsearch_utils in do_es... `Event Count:` 7

_{Did you find this useful? React with a 👍 or 👎}

mlissner commented 1 week ago

Thanks. This looks fine to me. From a process perspective, I set Alberto as the reviewer, but left it assigned to you. I'm thinking it's his job to review, but you should stay assigned as the one paying attention to it and being on top of it. Feel right?

mlissner commented 1 week ago

Another tweak, sorry. I'm putting both of you as assignees. This way it shows up under the reviewer's name in the By Assignee tab in the backlog. Figuring things out!

mlissner commented 1 week ago

I don't completely understand the fix/issue here, but would it be easier to just say, "You can only get XXX results per alert per day?" Feels like beyond some threshold, the alert system is being used in a really weird way. Maybe we just set that number and then document it somewhere?

albertisfu commented 1 week ago

I don't completely understand the fix/issue here, but would it be easier to just say, "You can only get XXX results per alert per day?" Feels like beyond some threshold, the alert system is being used in a really weird way. Maybe we just set that number and then document it somewhere?

Yes, we already have this limit defined by SCHEDULED_ALERT_HITS_LIMIT (currently set to 20 by default). I think we just need to document that this is the maximum number of cases an alert can display.

However, the issue here is slightly different. It's related to the the auxiliary queries we perform to classify alerts based on their hits: Docket-only, Cross-object, or RECAPDocument-only so that we can determine whether to display nested hits in the results. Additionally, this classification allows us to mark in their Redis SET if an alert has been triggered by a docket or RD. This ensures the alert is not triggered again by the same document in the future. So the issue here involves the ordering of documents in the results for these auxiliary queries when the scores are identical.

ERosendo commented 3 days ago

@albertisfu Sorry for all the back-and-forth on this PR. I tried your suggestion, but it didn't quite fix the issue.

In this test, we create four dockets, with only one having two recap documents. When I used the sorting rule you suggested (_score descending and timestamp descending), the result with the two child documents always came first.

I checked the explanation (Explain API) and saw that its score is the sum of the score from the child documents search and the score from matching the parent document itself. This explains why it ended up first.

However, when we run the auxiliary query, we can't replicate the same order using the same sorting rule. That's because the auxiliary query doesn't use a nested approach, and the scores are calculated differently. So, when we limit the results, we might unintentionally exclude documents that were included in the main query.

Do you have any other ideas to avoid removing the limit from auxiliary queries?

albertisfu commented 3 days ago

Yeah you're totally right. The scores from the main query will never match those from the auxiliary queries due to nested documents in the main query, so that approach isn’t a solution.

Do you have any other ideas to avoid removing the limit from auxiliary queries?

This is indeed complicated, but I’ve got an idea. As I mentioned in a previous comment, I think the problem described is a corner case that might occur rarely, but it can still happen. Since the solution I’m proposing below would require performing an additional ES query, we could limit its application to only those cases where it’s necessary and continue using your solution for the majority of requests.

The described issue will arise only when any of the auxiliary queries return exactly 10,000 results, which is the default maximum number of results returned by ES. If we receive that many results from any of the auxiliary queries, it indicates the possibility of more results being excluded, which could cause problems when filtering the main query's results.

In such cases, if one of the auxiliary queries returns 10,000 results, we’d need to repeat that auxiliary query or both (if both returned 10,000 hits) and include an additional filter as follows:

If the auxiliary query returning 10,000 results is the docket-only query: Extract all the docket_ids from the main query results and use them as a filter to refine the docket-only query. The filter should look like:

Q("terms", docket_id=docket_ids_list)

If the auxiliary query returning 10,000 results is the RECAP document-only query: Extract all the RECAPDocument IDs from the main query results. To do this, retrieve all the IDs within child_docs from the main query results and use them as a filter to refine the RECAP document-only query. The filter should look like:

Q("terms", id=rd_ids_list)

Once the required auxiliary query (or both) is updated with the additional filters, we'd need to repeat the query. This will ensure that the auxiliary queries return accurate results, as they now consider the docket_ids or RECAPDocument ids from the main query. So results returned won't miss any document.

Let me know what do you think.

Thank you!

ERosendo commented 2 days ago

Let me know what do you think.

Sounds good! I'll start working on the new code

freelawproject / courtlistener

fix(elasticsearch): Remove size parameter from auxiliary queries #4690

🔍 Existing Issues For Review