freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
554 stars 152 forks source link

Add "Create an Alert" for RECAP searches #612

Open anseljh opened 8 years ago

mlissner commented 8 years ago

I thought about this, but I haven't done it for the moment. The thing that's slowing me down is that most of the PACER alert systems (like Docket Alarm) will go and check a docket for you on some sort of regular basis. I'm afraid that if we create an alert system, people will expect that kind of service. I don't think this is hard though, since we already have alerts for two object types (oral args and opinions).

I also want to create alerts for dockets themselves. This would use the same system as the regular search, just filtered to a query like docket_id:23378. (When we do that, we should add alerts for cases, so you can get an alert any time a case is cited -- this functionality already exists, but it should be a simple button on every opinion page.)

mlissner commented 6 years ago

So this turns out to have two complicated problems:

  1. Content is often added to the RECAP Archive way after when it is published by the court. Some people could find this really valuable. For example, if you want to know whenever there's a document mentioning "The Onion Router" you might not care if the document is new or old or what.

  2. The way we keep Solr in sync with the database involves touching a lot of fields. Sometimes when we get a docket, we update the entire thing in Solr because we don't know which fields on the new docket we got are new. Because Solr is completely denormalized, every document in Solr has a copy of every field. For example, these might represent two documents in Solr:

    • case_id: 1, case_name: Walter v. New Mexico, document_number: 1, description: initial complaint
    • case_id: 1, case_name: Walter v. New Mexico, document_number: 2, description: response

    And then we might get an update to the docket as HTML from PACER. Well, shoot, it's possible that the case name changed, so we assume that every document for this docket in Solr needs to be updated, and that's what we do.

I think what we can do, to fix both of these problems, is to only do alerts for the text of documents. What we can do then is limit our alerts to documents that got text since the last time the alerts ran — in other words, only trigger on PDF text. We don't have to think about the problem that the docket name might have changed, and we don't have to think about the gazillions of docket entry descriptions that we otherwise would be searching against.

It also solves problem 1 because all of those files could be kept in a sidecar index or even in a little database table that could be a lot smaller.

This limits our alerts a bit, yep, but it isn't horrible and it simplifies them a bunch.

Apologies if this is a bit like confused rambling. Working through this is busting my brain a bit.

mlissner commented 6 years ago

I pondered this one some more. My first solution was:

  1. Only include an item in alerts when it gets its PDF.

That was weak, but solved part of the problem. New idea:

  1. Include an item in alerts when it gets its PDF, AND
  2. Include an item in alerts when it is first created.

That will suck when the following happens:

  1. We get the RSS feed contents. This creates a new item and populates the short description field.
  2. Alerts are run. Some are triggered, some not.
  3. We get the docket, which includes the full docket entry description. This does not trigger alerts since it's not a new item.

In other words, whenever we create the item with a subset of the fields, we won't trigger on those fields until we get the PDF associated with that document. This is...not great, but it's also not terrible. Some fields will be left out of alerts part of the time.


Another solution is to keep a solr index containing the diff of new content whenever we get it. There's probably a way to identify which fields are new when we get something and to only store those into Solr. This way, alerts would only trigger on new content, not on old, and we'd have a weird Solr index with a bunch of partial objects.

For example, say we had three fields representing pizza toppings:

  1. The first time we get data on a certain pizza, we learn that it has cheese. Great. We add the following to the Solr index:

    pizza: 1, topping1: cheese, topping2: '', topping3: '',
  2. A moment later, we learn that it also has mushrooms. Great, we update our regular Solr index and our regular database, and we add another item to our search Solr index:

    pizza: 1, topping1: '', topping2: mushrooms, topping3: ''
  3. A moment later, we get a great upload about the pizza that tells us all of its toppings. We compare this to our database, and it turns out we already knew about toppings 1 & 2, but topping 3 is news, so we add this to our Solr index:

    pizza: 1, topping1: '', topping2: '' topping3: onions

In other words, this index only keeps track of new information since the last time the item was updated. It's a diff index, if you will.

Next, we run alerts, and we search for, "Any pizza will onions". We run it against all of the fields, topping1, topping2, and topping3, and we learn that pizza #1 is a match! Great. We return that result.

This seems like it could work, assuming that doing the diffs as I describe them here isn't too terribly difficult.

I don't love it because it's complicated, but nothing is ever easy.

mlissner commented 5 years ago

Another idea that could solve this neatly. We keep a table of the documents that are triggered for a given alert. So in essence, you only can get an alert for a document once, no matter how many times it gets new pieces of data.

I think this is the solution to this issue I've been looking for.

mlissner commented 5 years ago

Here's a service selling these alerts and saying they're "economical" at about $10 each / month:

https://www.courtalert.com/Business-Development-Realtime-Federal-Complaints.asp

We should really get this figured out.

mlissner commented 4 years ago
  1. Keep a Solr index just for alerts. It contains one day's worth of results and is cleared each morning at 2am. That'll be about XXX items.

  2. Keep a redis or DB list containing all of the items that have been triggered for a given alert. Use this for two purposes:

    1. To avoid sending somebody an alert for the same item more than once.
    2. To load paginated alert results on CourtListener, in case an alert has more than 20 hits, say, and can't be sent in its entirety via email.
  3. Only allow alerts to be created if they don't trigger too many results. If it's more than about 50/day, there's just not much point in an alert. Handle this in the UI and trust users not to work around it.


Features:

anseljh commented 4 years ago

@mlissner why's this one closed? Did you decide not to do it, or...? I'm not quite following the comments.

mlissner commented 4 years ago

This issue is open and the top one I'm working on when I can find time.

anseljh commented 4 years ago

Ha. You're right. I'm a doofus.

mlissner commented 1 year ago
mlissner commented 1 year ago

Lots more discussion on this today. A few things to note and reiterate:

  1. We have to remember that documents and dockets come to us piece by piece. This means that we cannot just get a new docket or document and send an alert for it. Instead, we need to only send alerts once per docket and once per document. Imagine an alert for "Foo", and a docket that comes in with the case name "foo v. bar." Cool, we send an alert. Then we get the parties for the docket, which includes "Foo" again, so we shouldn't send an alert for the docket again. Later, we get docket entry description, which mentions the term "Foo". We do send an alert for that. Later we get the document text to go along with the docket entry. We do not send an alert for that.

  2. Parent-child searches are not possible in Elastic Search percolators. This is unfortunate because we need to use percolators to trigger alerts on the RECAP and Opinions databases. To solve this, we discussed two possible ways forward:

    • We create three percolator indexes. One for dockets, one for documents, and a third for parties. Then we split up alerts to only save the relevant fields into the correct index. When we get a new document, we percolate it against all of the percolators, and send an alert if they all hit. Unfortunately, this has a few problems:

      • If somebody has a alert like docket.case_name=FOO AND document.description=BAR, we can split that up into the two percolators and if they both hit, send the alert email. But if a user has docket.case_name=FOO OR document.description=BAR, we should send the alert if both or either percolator hits. The only way to know the difference between these queries is to parse them, which is a bad idea.
      • Somebody else might have an alert like docket.case_name=Foo. In that case, we only need to run against the docket percolator, and send alerts based on that. Kind of complicated to sort out.
    • A different approach is to flatten documents before percolating them against a single flattened percolator. This is OK, but it will have false positives because it loses structure. Imagine a query for Party=Mike AND Firm=Sibley in English you could think of this as "Cases where Mike is represented by Sibley". In a parent-child index, this will work because parties have structure:

      docket : {
        parties: [
          {name: Mike, firm: sibley}
        ]
      }

      In a flattened index, you'll have false positives on dockets like:

      docket : {
        parties: [
          {name: Mike, firm: walworth},
          {name: Sam, firm: Sibley}
        ]
      }

      This is because the flattened version will be:

      docket : {
        parties: [Mike, Sam],
        firms: [Walworth, Sibley],
      }

      I think this is enough of an edge case that we can just document it, and it should be OK. The good news is it's false positives — we'll send too many alerts. ARE THERE CASES WHERE WE'D HAVE FALSE NEGATIVES?

      • [ ] Document false positives across parent-child relationships in a way humans can understand.

So the plan going forward is:

albertisfu commented 6 months ago

@mlissner Here is a summary of the different features and requirements of this project we've been discussing, a brief overview of the architecture could use, and some questions so we can agree on the approach and start working on this project.

Since the percolator doesn't support join queries such as has_child or has_parent queries and also because we want to send Docket alerts and RECAPDocument alerts independently, we're going to create two percolator indices:

DocketDocumentPercolator

This index will include all the Docket fields that are currently part of the DocketDocument, including parties. Notice that the parties in this mapping are already flat since we're not using a child document to represent parties. So, false positives like the query you described in an example (Party=Mike AND Firm=Sibley) above are possible.

ESRECAPDocumentPercolator

This index will include the same fields as ESRECAPDocument, which are the RECAPDocument fields plus its parent Docket fields except for parties.

We won't need to perform an additional flattening process either for the percolator mapping or the documents ingested before percolating them.

Following this approach, when a new Docket is created/updated, it will be percolated into the DocketDocumentPercolator index to check if a query matches the Docket and trigger alerts containing only Dockets.

When a new RECAPDocument is added/updated, the ESRECAPDocument will be percolated into the ESRECAPDocumentPercolator index and trigger alerts containing only RECAPDocuments.

However, here are the limitations I found with this approach.

Consider this query: https://www.courtlistener.com/?q=&type=r&order_by=score%20desc&case_name=%22Andromeda%20T.%20Pearson%22&description=%22Notice%20of%20Motion%22

It returns 2 Dockets and 17 Docket entries in the fronted.

But the ESRECAPDocumentPercolator will be able to check only RECAPDocuments, so we can see that as how the V4 RD type returns the results: https://www.courtlistener.com/api/rest/v4/search/?q=&type=rd&order_by=score%20desc&case_name=%22Andromeda%20T.%20Pearson%22&description=%22Notice%20of%20Motion%22

It matches the 17 RECAPDocuments. This is possible because the case_name is indexed within every RECAPDocument.

The problem is the following: Imagine a Docket with 10,000 RECAPDocuments.

Initial Docket case_name: Andromeda T. RECAPDocuments that belong to this Docket have different descriptions.

Then we receive an upload that updates the Docket case_name to: "Andromeda T. Pearson".

We'll percolate the Docket into DocketDocumentPercolator and if there is a query like:

https://www.courtlistener.com/?q=&type=r&order_by=score%20desc&case_name=%22Andromeda%20T.%20Pearson%22

I'll be matched and trigger the alert, that's good.

However, now all the RECAPDocuments have also been updated with the new Docket case_name.

So, it's possible that many of those RECAPDocuments now can match the query:

case_name="Andromeda T. Pearson" + description="Notice of Motion"

https://www.courtlistener.com/?q=&type=r&order_by=score%20desc&case_name=%22Andromeda%20T.%20Pearson%22&description=%22Notice%20of%20Motion%22

To do this, we'll need to percolate every RECAPDocument that belongs to the Docket to see which of them matched the query, which means percolating 10,000 RECAPDocuments. I think we could use the percolator query that references the document from the original index to avoid transforming the documents to JSON and percolating from the application.

Q(
        "percolate",
        field="percolator_query",
        index=document_index,
        id=document_id,
    )

But this query only allows percolating one document at a time. So we'll need to repeat this query 10,000 times in the example scenario (and there can be worse scenarios with many more RDs that would need to be percolated). We can also use the multi-search API to execute many requests at once. I'll need to measure the performance of this approach, but I think it won't be as performant as we wanted and it could use a lot of resources, considering update_by_query queries to update child documents based on a parent document change are pretty common.

https://www.courtlistener.com/?q=id%3A374880858&type=r&order_by=score%20desc&party_name=KENNETH%20E.%20WEST

Can match a RECAPDocument by its ID and where its parent Docket contains the party: "KENNETH E. WEST". But regarding alerts, if the user saves this alert, it won’t ever get a hit.

This problem could be solved only by applying a join query, but unfortunately, they’re not supported by the percolator.

So maybe we could just inform users that an Alert involving a query with a party filter won’t work as in the frontend or the API. Basically, if they want consistent results similar to the ones they could get in the frontend or the API, they should only mix party filters with other Docket fields and avoid using a query string, since a query string could also match RECAPDocument fields.

Or we could also identify whether an Alert query has a party field and alert the user during its creation about its limitations or avoid creating it.

Changes in the UI:

We’ll need to have a UI where users can save RECAP Search Alerts. During the creation, they can decide if they want to match Dockets or RECAPDocuments or both.

Something like:

Screenshot 2024-06-06 at 3 23 02 p m

So we could create one or two alerts with the same query, one for the Docket alert and/or one for the RECAPDocument alert.

The query version we’ll store in the percolator will be the one specific to the document type, excluding all the join queries. We already have these queries that are used to get the Docket and RECAPDocument counts separately. So these queries can be indexed to their percolator index, either DocketDocumentPercolator or RECAPDocumentAlertPercolator.

Avoid triggering duplicate alerts.

We need to avoid an alert being triggered more than once by the same Docket.

For doing that, we planned to use a bloom filter that will keep track of the alerts that have been sent so they’re not triggered more than once.

However, I think the bloom filter is possibly not the right approach.

We could have a global bloom filter to store Docket-Alert pairs so we can know when it has already been triggered and avoid triggering it again. The problem with this global filter is that it'll grow too fast since new elements will be added to every alert that is triggered.

So it'd be better to have one bloom filter for each Alert in the database so it can store the docket_ids that have triggered that alert. In that way, we'd have an equal number of bloom filters to Alerts but they'll be small.

But the problem I see with the bloom filter is:

The problem is false positives because we'll store the docket_id in the filter once the Docket has triggered that alert. However, a false positive saying that the alert is there but in reality, it is not could lead to avoiding sending alerts that should be sent.

Since false negatives are not possible, there is no possibility of duplicate alerts, which is good. But I recall we discussed that it's more important to not miss any alerts.

We could reduce the probability of getting a false positive by selecting a big bloom filter size and a good hash function, but possibly it's better to just use a SET.

So the alternative approach is just to create a Redis SET for each alert and store each docket_id that triggered the alert:

alert_1: (400,5600, 232355, 434343, etc.)

Adding new elements to the set or checking if an ID is already in the set can be done in constant time.

SISMEMBER alert_1 400
True

So if an ID is already in the SET, we just omit sending the alert.

Grouping alerts.

Another requirement is grouping RT alerts whenever possible according to #3102. As described in the issue I think the only way to achieve this if we end up using the percolator (in the inverse query this won't be required) is to add a wait before sending the alerts so if more alerts are matched during the waiting time, we could send them in a single email.

Let me know what do you think.

mlissner commented 6 months ago

About percolators and parent-child queries...

This is a real bummer and you're right that it comes with a bunch of tradeoffs. From a design perspective, I want this to be as seamless as possible. Ideally, people do a query in the front end, create an alert, and it works with some minor tradeoffs or imperfections. I'm afraid that where we're headed is:

If that's where we land, I think we're in trouble, so we have some work ahead of us to sort this out.

I did a little research on the parent-child percolator, and one person said they could use nested queries against the percolator. Is that a crazy idea?

The other alternative is to abandon the percolator approach and use the inverse query approach

This is really not performant, and I want organizations to be able to make 10,000 alerts each, creating millions of alerts. If each alert takes 1s, it'll never work, or at best it'll take a huge number of servers. It's also a bummer that it's not actually real time.

About changes in the UI...

I'd like to avoid users even thinking about dockets vs. documents when they make alerts, but it could be the solution we need. Maybe instead of one button to create alerts, we offer two:

If we do that, I bet most of our users would be satisfied, and it'd be clear that cross-object queries are going to work poorly.

I think if we do this, we remove the docket-related stuff from the RECAP percolator (no case name, etc.)

Q: Can we robustly identify when somebody is making a cross-object query?

On not sending dups...

Yeah, bloom filters would have been fun. Someday. Redis sets it is.

About grouping alerts...

Spec:

Thank you!

albertisfu commented 6 months ago

I did a little research on the parent-child percolator, and one person said they could use nested queries against the percolator. Is that a crazy idea?

In this case, we need to convert the parent-child queries to nested queries and evaluate if they match the same documents or if this conversion results in some false positives or false negatives.

However, this approach might still have some performance issues to evaluate. For instance, to percolate a document against the nested query percolator, we need a document structured as parent-nested-child, which means creating a JSON document in memory representing the Docket with RECAPDocuments. This would be fine for small cases, but the JSON object can be massive for large dockets with thousands of RECAPDocuments. Then, we need to send this document to the percolator and hope it's performant enough. We'll need to do this because, in this case, we won't be able to reference the already indexed document to percolate it due to the current Dockets and RECAPDocuments indexed int ES having a different structure than what is required for a nested query percolator.

I'll do some tests around this idea to measure its performance.

I'd like to avoid users even thinking about dockets vs. documents when they make alerts, but it could be the solution we need. Maybe instead of one button to create alerts, we offer two:

Create docket alert Create document alert If we do that, I bet most of our users would be satisfied, and it'd be clear that cross-object queries are going to work poorly.

Great, I like the idea of offering two different buttons to create these alerts. I'll propose some ideas about where and how we can place these buttons in the UI instead of the current bell icon.

I think if we do this, we remove the docket-related stuff from the RECAP percolator (no case name, etc.)

This means that if we can't find a good solution to percolate the original frontend query without the problems described above, we'll end up offering Docket Alerts that only match Docket fields and Document Alerts that only match RECAPDocument fields?

Q: Can we robustly identify when somebody is making a cross-object query?

I'm afraid this is not possible. While we can robustly identify whether the query contains combined parent or child filters, or even within a string query if the user is using advanced syntax:

For instance: https://www.courtlistener.com/?q=docketNumber%3A%222%3A16-cv-00501%22&type=r&order_by=score%20desc&case_name=Bank%20of%20America&description=Expedited%20Motion

The problem is that we cannot identify a cross-object query in simple query strings. For instance: q: Bank of America Expedited Motion

https://www.courtlistener.com/?q=Bank%20of%20America%20Expedited%20Motion&type=r&order_by=score%20desc

This query can match the string within some Docket fields or RECAPDocument fields. For example, it can match a docket with part of the string in the case_name (and other parent fields) and also match RECAPDocuments with the whole or partial string within the plain_text description, etc.

In cases like these, it's impossible to know (without performing the actual query) whether the query can match only Dockets, only RECAPDocuments, or both.

Spec:

Five minute groups for emails. No wait for webhooks.

Great!

Thanks for your answers and suggestions to explore.

mlissner commented 5 months ago

but the JSON object can be massive for large dockets with thousands of RECAPDocuments

At first, I was thinking that If you got changes to a docket, you could just percolate only the docket info, without any documents at all, and that if you got changes to a document, you could just nest that one document within the docket.

But now I'm realizing that if you have a query like:

docket_name: foo
plaintext: bar

You might get this information today:

docket.case_name: baz
document.text: bar

You wouldn't send an alert, because the docket_name doesn't match. But tomorrow the name might be updated to foo, and you'd want to send the alert.

I think that implies that:

  1. We can do docket-only and document-only alerts reliably using nested queries
  2. We can do cross-object alerts reliably when the new data is a document (just nest it in the docket and do the query).
  3. But we can NOT do cross-object alerts reliably when the docket is updated unless we create huge nested objects to percolate.

If that's right, I think we're getting close to a solution.

There is one other strategy that we can use here, which is to create a new index each day, and to use that for a sweep (so many sweeps, lately!). The idea here is that querying 500M items is really hard and slow. The only thing you really need to query is the new stuff of the day. So, what you do is:

  1. All day long, you add new/changed content to two indexes, the regular one and a daily one.
  2. At midnight, you run all your cross-object queries against the tiny one.
  3. If you get alerts, you check if those sent out earlier in the day.
  4. If not, you send alerts.
  5. You empty the tiny one and start over the next day.

If we do that in addition to the nested queries, we'd be sure to get everything, and we'd have a somewhat performant solution, since we'd only be querying against a couple hundred thousand items.

Most alerts would be real time. Some cross-object ones even would be, and the corner case would be covered.

What do you think?

This means that if we can't find a good solution to percolate the original frontend query without the problems described above, we'll end up offering Docket Alerts that only match Docket fields and Document Alerts that only match RECAPDocument fields?

Yes. Kind of lame though.

In cases like these, it's impossible to know (without performing the actual query) whether the query can match only Dockets, only RECAPDocuments, or both.

So that would be considered a cross-object query, because it queries across more than one object type.

albertisfu commented 5 months ago

You wouldn't send an alert, because the docket_name doesn't match. But tomorrow the name might be updated to foo, and you'd want to send the alert.

Yeah, exactly. The problem is directly related to docket field updates that can impact cross-object queries.

I think that implies that:

We can do docket-only and document-only alerts reliably using nested queries We can do cross-object alerts reliably when the new data is a document (just nest it in the docket and do the query). But we can NOT do cross-object alerts reliably when the docket is updated unless we create huge nested objects to percolate. If that's right, I think we're getting close to a solution.

Yeah, the nesting you have in mind (nesting the document into the docket and percolating it) is to allow us to match any cross-object query, including parties, correct?

Because most of the docket fields (except parties) are indexed into each RECAPDocument, a plain query will be enough to match most queries except for those that include party filters. If so, I agree the nested query seems the right solution.

But we can NOT do cross-object alerts reliably when the docket is updated unless we create huge nested objects to percolate.

Yeah, that's correct.

There is one other strategy that we can use here, which is to create a new index each day, and to use that for a sweep (so many sweeps, lately!). The idea here is that querying 500M items is really hard and slow. The only thing you really need to query is the new stuff of the day. So, what you do is:

This is a pretty good idea!

Just some questions:

At midnight, you run all your cross-object queries against the tiny one.

So we'll need to categorize the alerts into two types: cross-object alerts and non-cross-object alerts. Cross-object alerts will be all the queries that include either:

If you get alerts, you check if those sent out earlier in the day.

Got it. I think we can use the same set in Redis proposed to avoid duplicates. So we'll have one set per alert that will store either Docket IDs or RECAPDocument IDs. This way, it won't matter if the alert was triggered today or in previous days; it won't be triggered again, avoiding duplicates for both the normal process and the midnight sweep.

One question here is how are we going to tag/schedule alerts sent at midnight. We have four alert rates:

In the percolator approach in OA, we do the following:

We trigger webhooks in real-time for all the rates.

I think we can do the same for alerts that are matched in real-time by the percolator.

But what would happen, for instance, for RT cross-object alerts that were missed during the day? Once they're a hit at midnight, will we group all the missed alerts during the day for a user and send a single email? How would we call that email? Because it is not a real-time email anymore, nor is it a daily email, as the alerts don't belong to the daily rate.

If the missed alerts belong to the daily rate, maybe we could execute the midnight sweep and see if some of the daily alerts had hits, then append those hits to the scheduled hits during the day via the percolator and send a single daily email.

For weekly and monthly rates, I think it can work similarly. Use the midnight sweep to store and schedule the hits according to the rate so they can be sent every week or month alongside the ones scheduled by the percolator.

Webhooks Regarding webhooks, once missed hits are matched at midnight, should we send all of their related webhooks at once, regardless of their rate? Which rate should we put in the payload for these?

mlissner commented 5 months ago

Yeah, the nesting you have in mind (nesting the document into the docket and percolating it) is to allow us to match any cross-object query, including parties, correct?

Yes.

This is a pretty good idea!

I've been thinking about this for years, but I was hoping not to have to do this, so hadn't mentioned it. But here we are. :)

So we'll need to categorize the alerts into two types: cross-object alerts and non-cross-object alerts.

Yeah, I think so, but if we do a sloppy job that says some docket-only or document-only alerts are actually cross-object, that'd be fine, right? We'd run an extra query, but wouldn't send extra alerts. So long as we err in that direction, we should be fine?

One question here is how are we going to tag/schedule alerts sent at midnight.

Pretty simple. We run our sweep, and send an email with the sweep results. We put extra words in the subject and body to explain what it's about. We continue doing everything with the daily, weekly, and monthly alerts same as before.

Regarding webhooks, once missed hits are matched at midnight, should we send all of their related webhooks at once, regardless of their rate?

Sure, or you can send them in separate payloads. Whatever is easier. I assume it's easier to keep these processes separate.

Which rate should we put in the payload for these?

Real time, and then we document the situation by saying:

"Sometimes cross-object real time alerts will arrive at the end of the day. This is because blah, blah..."

What else??? :)

albertisfu commented 5 months ago

Yeah, I think so, but if we do a sloppy job that says some docket-only or document-only alerts are actually cross-object, that'd be fine, right? We'd run an extra query, but wouldn't send extra alerts. So long as we err in that direction, we should be fine?

Hmm, I think in that scenario we'd miss alerts.

If we mistakenly tag cross-object queries as docket-only or document-only, those queries won't run at midnight, leading to missed hits.

On the other hand, if we mistakenly tag docket-only or document-only queries as cross-object, we'll run extra queries, but we won't send duplicates.

So, we should be careful when categorizing the queries or run the sweep over all the queries.

Pretty simple. We run our sweep, and send an email with the sweep results. We put extra words in the subject and body to explain what it's about.

Perfect, this is for the RT rate, right?

We continue doing everything with the daily, weekly, and monthly alerts same as before.

Got it. So, in this case, to continue doing everything for the daily rate as before, we'd just need to ensure the normal daily send is triggered after the midnight sweep so those hits can be included in the daily send. For the weekly and monthly rates, if we want to include the results of that day as well, they should also run after the midnight sweep. We just need to confirm if the sending time is okay because if the midnight sweep runs at 12:00 and takes 15 minutes to complete, we'll need to send the Daily, Weekly, or Monthly emails after 12:15. If that's not okay, they can be included the next day for the daily rate or the next week or month, for the other rates.

mlissner commented 5 months ago

Yeah, we want to err on the side of saying something is cross-object if we have any doubt. I agree.

Perfect, this is for the RT rate, right?

Yes, exactly.

We just need to confirm if the sending time is okay because if the midnight sweep runs at 12:00 and takes 15 minutes to complete, we'll need to send the Daily, Weekly, or Monthly emails after 12:15

Yeah, that's fine. Nobody cares if their daily/weekly/monthly alerts are exactly at midnight.

I'd suggest making this one command that does both the sweep and the daily/monthly/weekly alerts, so that it does one task, then the other without having to schedule things and hope the sweep is done before the other one triggers.

albertisfu commented 5 months ago

Excellent! I think we now have a good plan to work on.

Thank you!

mlissner commented 5 months ago

And you. Epic!

Would it make sense to do two PRs? One for regular alerts and one for the sweep?

albertisfu commented 5 months ago

Yeah, I agree, two PRs make sense for the project!

albertisfu commented 5 months ago

@mlissner working on adding the Percolator index for RECAP, I have a couple of new questions that can impact the Percolator and the sweep index design. We plan to percolate RD documents nested within a Docket document to trigger alerts for RECAPDocuments reliably or percolate only Docket documents without any nested RD for triggering Docket-only alerts.

Maybe instead of one button to create alerts, we offer two: Create docket alert Create document alert If we do that, I bet most of our users would be satisfied, and it'd be clear that cross-object queries are going to work poorly. I think if we do this, we remove the docket-related stuff from the RECAP percolator (no case name, etc.)

Considering we'll solve the issue related to cross-object queries on document updates by using the daily midnight sweep, should we still divide alerts into two types?

Alternatively, we could have only one type of alert, "RECAP" and we could send either only Dockets that matched, RD that matched (showing their docket fields), or a combination of Dockets + RD that matched (which can be grouped):

For example, consider the following scenario: Screenshot 2024-06-17 at 8 00 35 p m

Does it make sense for user needs to still divide alerts for Docket and RDs? I think it would still make sense to split alerts in two types if users want to know which type of object triggered the hit in the alert they're receiving.

The following question is also a bit related but regarding the alert structure and also the percolator design.

What would be the structure of the emails if we end up offering two types of alerts to users? Or in case we go with only one alert type for RECAP.

  • Create docket alert
  • Create document alert

Using nested queries (or a plain approach I'm experimenting with) and the midnight sweep, we'll be able to send alerts for non-cross-object and cross-object queries.

I imagine the email for a document alert (RD) like this:

Screenshot 2024-06-17 at 7 56 14 p m

In this case, imagine the RECAPDocuments don't contain the search query "United States" in any of their fields. But they will match because "United States" is within the Docket case name. Is this correct, even if users created the alert for the document alert type? I think that's correct because it follows the behavior in the frontend where RECAPDocuments can be matched by Docket fields, which are indexed within the RECAPDocument, so RDs can be matched by only docket fields.

If this is the expected behavior, it will be important to show the Docket fields similarly to the frontend, so users can understand why those documents are being matched even if the keywords don't appear directly within the RDs.

Or should the behavior be that only RECAPDocuments with fields that match the query will be included in the alert, for instance:

Screenshot 2024-06-17 at 7 59 27 p m

Depending on the expected behavior, the design of the daily sweep index will change. If RDs can be matched by Docket fields, we could simply mirror the current RECAP search index. If the second option is preferred, we would need to switch to an index with a nested documents approach and expect RDs to be matched independently of Docket fields.

And for the Docket-only alert (in case we still need to split alerts) can be as follows:

Screenshot 2024-06-17 at 8 01 41 p m

The main difference is that it will include only Dockets without any entries.

mlissner commented 5 months ago

should we still divide alerts into two types?

No. If we can avoid the two alert types, we really should. That was just an idea if we couldn't find a better way forward.

Does it make sense for user needs to still divide alerts for Docket and RDs? I think it would still make sense to split alerts in two types if users want to know which type of object triggered the hit in the alert they're receiving.

I think the emails should try to match the search results as much as possible. So when there's a docket result, it just shows dockets, when it's a document result, it shows the nested document inside the correct docket.

To the user, it should be seamless and they shouldn't think about documents vs. dockets when making or receiving alerts (just like they don't when doing a query).

In this case, imagine the RECAPDocuments don't contain the search query "United States" in any of their fields. But they will match because "United States" is within the Docket case name. Is this correct, even if users created the alert for the document alert type? I think that's correct because it follows the behavior in the frontend where RECAPDocuments can be matched by Docket fields, which are indexed within the RECAPDocument, so RDs can be matched by only docket fields.

I don't think that's ideal, but if it matches the front end, it's OK. Ideally, the email would just have a docket if it only matched on docket fields (and the front end too, I guess).

Or should the behavior be that only RECAPDocuments with fields that match the query will be included in the alert

That is better, yes.

Depending on the expected behavior, the design of the daily sweep index will change.

I think this just depends on how hard it is. We'd like to go for the ideal, correct solution at first. How much more time would you estimate it would take? If it's just a little bit, then let's go for it. If it's more than a few days, maybe it's better to do it as an enhancement down the road?

albertisfu commented 5 months ago

I think the emails should try to match the search results as much as possible. So when there's a docket result, it just shows dockets, when it's a document result, it shows the nested document inside the correct docket. To the user, it should be seamless and they shouldn't think about documents vs. dockets when making or receiving alerts (just like they don't when doing a query).

Got it. Yeah, I agree, this seems like the better approach.

I don't think that's ideal, but if it matches the front end, it's OK. Ideally, the email would just have a docket if it only matched on docket fields (and the front end too, I guess).

Yeah, this is how the frontend currently behaves. However, I don't think it's an issue in the frontend because the documents matched by docket fields don't affect the meaning of the search; they're just "extra documents." However, in alerts, I can see how it could be confusing because users might think those documents are directly related to the keywords in the query when the only relation is that they belong to the docket.

I think this just depends on how hard it is. We'd like to go for the ideal, correct solution at first. How much more time would you estimate it would take? If it's just a little bit, then let's go for it. If it's more than a few days, maybe it's better to do it as an enhancement down the road?

Well, going for the correct solution, which involves only matching RECAPDocuments by their own fields while still matching cross-object queries when they should match, implies we'd need nested mapping for both the percolator and the daily sweep index. Additionally, we'd need to apply a kind of grouping when matching alerts for only Dockets or only documents queries that belong to the same case, so they are shown in the same entry in the alert. I estimate this could take about ~2 extra days.

One of the things we should take care of when doing this is ensuring that this new approach follows the results in the frontend as closely as possible without missing anything, except for matching RECAPDocuments by Docket fields. I'll be doing some tests around this to confirm that. Also, I can see that using the nested approach in the daily sweep index means that if many documents are added/updated during a day, the whole document Docket and all its child nested documents will be updated every time another document is added or updated. I expect this number to be a maximum of around a hundred documents (considering they're documents for the day), so it shouldn't represent a performance issue.

mlissner commented 5 months ago

Great. If it's only two days, let's go for it.

many documents are added/updated during a day, the whole document Docket and all its child nested documents will be updated every time another document is added or updated.

I don't understand what you mean here. Can you explain for me?

albertisfu commented 5 months ago

Great. If it's only two days, let's go for it.

Perfect!, already working on it.

I don't understand what you mean here. Can you explain for me?

Sure, I meant that a nested document will look something like this:

{
   "case_name":"Lorem ipsum",
   "docket_number":"21-564",
   "documents":[
      {
         "description":"Test description",
         "plain_text":"Test plain",
         "document_number":1
      },
      {
         "description":"Test description",
         "plain_text":"Test plain",
         "document_number":2
      }
   ]
}

The first issue is related to the number of documents nested within the parent document. The more nested documents there are, the more memory is required to handle them within the cluster. The documentation states that the default limit is 10,000 to prevent performance issues: https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html#_limits_on_nested_mappings_and_objects

Since we'll only add documents created or modified during the day, I expect the number of nested documents in a Docket to not be too large and to always remain below 10,000.

The other issue concerns indexing and updates. A document with a nested field is treated as a single unit, so in order to change a parent field or add/update a nested document, Elasticsearch requires performing a complete reindexing of the document. Thus, if a Docket contains too many documents for the day, and we continue adding/updating it, the cluster internally performs a full reindex of this document every time it is changed.

I think we have two options to handle this process:

mlissner commented 5 months ago

Since we'll only add documents created or modified during the day, I expect the number of nested documents in a Docket to not be too large and to always remain below 10,000.

Yes, that's a very safe assumption. The worst case are bankruptcy cases, which can have something like 100 docs in a day, but that's still not common.


For indexing performance, it sounds like there are three options:

  1. Do the simple thing and just index stuff as it comes in.

  2. Index stuff using painless scripts.

  3. Do it as a batch at the end of the day.

Number 1 is least performant, but simplest. Number 2 saves some bandwidth, but doesn't help the cluster ("internally the cluster will perform the complete reindex for each of these requests"). Number 3 will save the elastic cluster some effort at the cost of the database and batching everything at the end.

My vote is for number 1 because we should avoid doing premature optimizations, and it seems simplest. I also always prefer processes that spread performance over the day instead of doing big pulls all at once, which also favors number 1.

So I'd suggest we go that direction, and if it isn't fast enough we can upgrade to a better solution?

albertisfu commented 5 months ago

My vote is for number 1 because we should avoid doing premature optimizations, and it seems simplest. I also always prefer processes that spread performance over the day instead of doing big pulls all at once, which also favors number 1.

So I'd suggest we go that direction, and if it isn't fast enough we can upgrade to a better solution?

Got it. Yeah, I agree, option 1 is the simpler solution, and we can perform optimizations if they’re required. Just a note about option 1 that I noticed will be required. Every time we get a Docket or RD add/update during the day, we’ll need to create a JSON file holding the updated state of the case (Docket fields + RDs) for that day. Therefore, a database query that filters out the RECAPDocuments created or modified during the day will be required every time we get a new add/update to built the correct JSON; otherwise, we’ll end up indexing the whole case in the daily sweep index. But I expect this query to be performant since we have indexes on the date_created and date_modified columns.

albertisfu commented 5 months ago

I thought we had already solved all the important issues here and had a solid plan, but while working on it, more problems and questions have surfaced.

I created in https://github.com/freelawproject/courtlistener/pull/4127 the RECAP Search Alerts sweep index based on the nested approach and also created a compatible nested query approach and tested them.

I found the following:

Most of the tests that involved only docket-only fields text queries, RECAP-only fields text queries, or any combination of filters (only docket, only RECAPDocuments, or combined fields on filters) worked well with no difference from the parent-child approach used in the frontend and the API.

However, tests related to cross-object fields text queries are failing.

One of the reasons we decided to try the nested index approach was to avoid sending false positive alerts when ingesting RECAPDocuments that belong to a docket that could trigger alerts involving only docket fields (which should be triggered only by a docket ingestion). Since those fields are indexed in the regular index into each RECAPDocument, documents could trigger alerts in those cases.

In fact, the nested index approach helps to prevent the problem described above. When using a nested query, it can only reach fields in the child documents, and the parent query component can only reach parent fields. However, this feature of nested documents is also causing cross-object text queries not to work.

For instance, consider the following case document:

case_name: “America v. Lorem”
docket_number: “23-54547”
documents:
       document_number:1, description: “Motion Ipsum”
       document_number:2, description: “Hearing Ipsum”

Now consider the query: q=”Motion Ipsum America”

In the current RECAP Search this query will return:

case_name: “America v. Lorem”
docket_number: “23-54547”
documents:
      document_number:1, description: “Motion Ipsum”

This is possible because parent fields like case_name are indexed within each RECAPDocument, and the has_child query is structured so that every term in the query is looked into all the searchable fields. Therefore, we’re able to return the right match.

This also allows fielded text queries to work properly: q=”document_number:2 AND docket_number:23-54547”

Will return:

case_name: “America v. Lorem”
docket_number: “23-54547”
documents:
       document_number:2, description: “Hearing Ipsum”

However, I found that this type of cross-object query is failing in the nested index approach because, in the nested approach, the document looks like this:

{
   "case_name":"“America v. Lorem”,
    "docket_number":"“23-54547”,
     "documents":[
      {
         "document_number":1,
         "description":"Motion Ipsum"
      },
      {
         "document_number":2,
         "description":"Hearing Ipsum"
      }
   ]
}

Parent fields are not indexed into each nested document.

So a query like: q=”Motion Ipsum America”

Looks like:

bool:
  should:
      nested:  # Child query.
          query: "Motion Ipsum America"
             fields:
                   "documents.short_description",
                     "documents.plain_text",
                     "documents.document_type",
                      "documents.description^2.0"

     query_string: # Parent query
          query: "Motion Ipsum America"
              fields:
                 "case_name_full",
                  "suitNature",
                  "cause",
                  "juryDemand",
                  "assignedTo",
                  "referredTo",
                  "court",
                  "court_id",
                  "court_citation_string",
                  "chapter",
                  "trustee_str",
                  "caseName^4.0",
                  "docketNumber^3.0"

So the whole phrase "Motion Ipsum America" is not found in any of the child documents or parent documents within their local fields.

It is also not possible to query a parent field from the nested query or, conversely, a nested field within the parent query context.

The solution would be the same as we used in the parent-child approach: index parent fields into each nested document. However, this brings us back to where we began, as we wouldn't be able to avoid triggering alerts for docket-only queries when ingesting any RECAPDocument that contains the docket fields indexed.

In brief:

So the proposed solution and its trade-offs are explained in the following tables:

Sweep index:

  Docket-only fields query RECAPDocument-only fields query Cross-object queries
Document ingested: Docket Alert triggered Alert not triggered, because it just doesn’t match. We won’t be able to trigger an alert because no RECAPDocument was ingested during the day, so cross-object queries won’t return results.
Document ingested: RECAPDocument Alert triggered. In this case, the problem is that when indexing the RD, the Docket is also indexed, and the docket-only query will be matched even if the docket was not ingested during the day. Posible workaround: When ingesting RDs, avoid indexing their parent Docket. Only index Dockets if they’re created or updated during the day. During the midnight sweep, perform two independent plain queries: one targeting Dockets and one targeting RDs. If a match is found in the Docket plain query, it indicates that a Docket was ingested today matched the Docket-only fields query so trigger the alert. If a match is found in the RD plain query, check if any of the HL fields belong to a Docket. If so, avoid sending the alert. This will help prevent sending duplicate alerts for the same alert case. Alert triggered Trigger the alert if matched. When indexing a new RECAPDocument, the Docket fields are also indexed/updated within the RD, making it possible to trigger these alerts.
Document ingested: Both Docket and RECAPdocument indexed during the day Alert triggered Alert triggered Trigger the alert if the Docket and its related RECAPDocuments ingested during the day match the query.

Percolator:

  Docket-only fields query RECAPDocument-only fields query Cross-object queries
Document ingested: Docket Alert triggered Alert not triggered, because it just doesn’t match. Alerts won’t be triggered because only docket fields are being percolated.This this scenario will be partially handled by the sweep index. For RECAPDocuments that belong to the case and indexing also during the day and match the cross-object query.
Document ingested: RECAPDocument Using HL filtering: Alert not triggered No using HL filtering: Alert triggered Alert triggered Alert triggered if matched, as it is possible because Docket fields are indexed within each RECAPDocument.

In summary, the proposed solution considering the trade-offs above will be as follows:

Create the sweep index using the same structure as the regular search index for RECAP.

However, in this approach, we’ll still have a partial issue regarding Docket indexing and cross-object queries.

We’ll be able to trigger alerts for cross-object queries via the sweep index but only for those RECAPDocuments indexed or updated (independently) during the day.

For instance, consider the following example:

q=case_name: “Lorem Ipsum” AND description

Original case:

case_name: “Lorem”
  document_1.description: Motion to…
  document_2.description: Motion to…

During the day, the Docket is updated to case_name: “Lorem Ipsum” and goes to the sweep index.

Also during the day, we get an upload for document_1 and its description is updated to: “Motion to hear…” and the document goes to the sweep index.

document_2 is not indexed into the sweep index because it didn’t receive an update during the day.

At midnight, the sweep index runs the query: case_name: “Lorem Ipsum” AND description:Motion And it matches the Docket and document_1 and sends the alert.

Final questions and considerations:

mlissner commented 5 months ago

Thanks for all the details, and shoot, I guess it's back to plan A.

Using the highlighting to do alert filtering is a great and novel idea. Nice one. Let's do that.

However, I wonder if document_2 should also be included in the alert because, after the Docket update its case_name, this document also matches the query.

You're right, it should be included in this case and we can't document our way out of it, so when this is the case, we'll just have to do the batch updating. A few thoughts:

  1. Can we just update the sweep index once at the end of the day? That'd prevent us ingesting entire dockets into the sweep index multiples times throughout the day, if it changes multiple times.

  2. Is there an API pull data from one index to another? Feels like the kind of thing Elastic would have and a way to make this perform better?

albertisfu commented 5 months ago

Can we just update the sweep index once at the end of the day? That'd prevent us ingesting entire dockets into the sweep index multiples times throughout the day, if it changes multiple times.

Yeah, I think this is better. Just collect all the dockets that changed during the day and index them at the end of the day into the sweep index along with all their child documents.

Is there an API pull data from one index to another? Feels like the kind of thing Elastic would have and a way to make this perform better?

Sure, I think this is a perfect task for the Reindex API. We have used it in the past to migrate an entire index, but it's possible to use it with a query that selects which documents should be moved. We could just select dockets with a date_modified greater than the current day and their child documents too. For that, I think we can use a painless script. I'll do some tests to confirm the process.

Thanks!

albertisfu commented 5 months ago

An update/question here.

We're going to use a Redis SET to avoid triggering an alert more than once for the same document.

For instance, if alert_1 is triggered by a Docket ID 400:

The SET will be updated as: alert_1: (5600, 232355, 434343, 400)

So, if the alert is triggered again by the same Docket ID 400, it won't be sent again.

However, I noticed that we'll need to keep track of the RD IDs because RD-only alerts or cross-object queries can also be triggered by RDs.

So, it is possible that an alert is triggered by an RD in the case, and then it can also be triggered by a different RD in the same case. If we only store the Docket ID that triggered the alert, we won't be able to trigger the alert for different RDs in the same case.

Therefore, I'm thinking of updating the SET to store Docket or RD IDs, so it'll look like this:

alert_1: (d_5600, d_232355, d_434343, d_400, rd_543235, rd_300, rd_2000)

or holding a SET for each alert:

d_alert_1: (5600, 232355, 434343, 400)
r_alert_1: (543235, 300, 2000)

This way, we can keep track of the Dockets or RDs that triggered the alert independently.

Does that sound right to you? Alerts can be triggered by different RDs in the same case?

mlissner commented 5 months ago

Yes, this is exactly right. I think two keys per alert looks tidier, but I'd suggest something more like alert_hits:1.d and alert_hits:1.rd`, etc?

albertisfu commented 2 months ago

Following up on the question raised during the RECAP Search Alerts architecture review regarding the Percolator's lack of support for parent-child queries and the possibility of contributing to a solution.

According to https://github.com/elastic/elasticsearch/issues/2960#issuecomment-65052242 the main issue they describe with adding support for parent-child queries is the need to store documents in memory to percolate them one by one.

The approach they seem to be considering involves percolating a parent document. Since the document can only trigger queries involving parent fields, it would be necessary to retrieve all child documents belonging to the parent (from the main documents index), store them in memory, and percolate each one individually to match has_child queries.

This approach would be resource intensive specially regarding memory and would not scale well, especially with parent documents that have a high cardinality of child documents.

mlissner commented 1 month ago

This is now in beta. We're working on the pricing for it and experimenting with it.