Open anseljh opened 8 years ago
So this turns out to have two complicated problems:
Content is often added to the RECAP Archive way after when it is published by the court. Some people could find this really valuable. For example, if you want to know whenever there's a document mentioning "The Onion Router" you might not care if the document is new or old or what.
The way we keep Solr in sync with the database involves touching a lot of fields. Sometimes when we get a docket, we update the entire thing in Solr because we don't know which fields on the new docket we got are new. Because Solr is completely denormalized, every document in Solr has a copy of every field. For example, these might represent two documents in Solr:
And then we might get an update to the docket as HTML from PACER. Well, shoot, it's possible that the case name changed, so we assume that every document for this docket in Solr needs to be updated, and that's what we do.
I think what we can do, to fix both of these problems, is to only do alerts for the text of documents. What we can do then is limit our alerts to documents that got text since the last time the alerts ran — in other words, only trigger on PDF text. We don't have to think about the problem that the docket name might have changed, and we don't have to think about the gazillions of docket entry descriptions that we otherwise would be searching against.
It also solves problem 1 because all of those files could be kept in a sidecar index or even in a little database table that could be a lot smaller.
This limits our alerts a bit, yep, but it isn't horrible and it simplifies them a bunch.
Apologies if this is a bit like confused rambling. Working through this is busting my brain a bit.
I pondered this one some more. My first solution was:
That was weak, but solved part of the problem. New idea:
That will suck when the following happens:
In other words, whenever we create the item with a subset of the fields, we won't trigger on those fields until we get the PDF associated with that document. This is...not great, but it's also not terrible. Some fields will be left out of alerts part of the time.
Another solution is to keep a solr index containing the diff of new content whenever we get it. There's probably a way to identify which fields are new when we get something and to only store those into Solr. This way, alerts would only trigger on new content, not on old, and we'd have a weird Solr index with a bunch of partial objects.
For example, say we had three fields representing pizza toppings:
The first time we get data on a certain pizza, we learn that it has cheese. Great. We add the following to the Solr index:
pizza: 1, topping1: cheese, topping2: '', topping3: '',
A moment later, we learn that it also has mushrooms. Great, we update our regular Solr index and our regular database, and we add another item to our search Solr index:
pizza: 1, topping1: '', topping2: mushrooms, topping3: ''
A moment later, we get a great upload about the pizza that tells us all of its toppings. We compare this to our database, and it turns out we already knew about toppings 1 & 2, but topping 3 is news, so we add this to our Solr index:
pizza: 1, topping1: '', topping2: '' topping3: onions
In other words, this index only keeps track of new information since the last time the item was updated. It's a diff index, if you will.
Next, we run alerts, and we search for, "Any pizza will onions". We run it against all of the fields, topping1, topping2, and topping3, and we learn that pizza #1 is a match! Great. We return that result.
This seems like it could work, assuming that doing the diffs as I describe them here isn't too terribly difficult.
I don't love it because it's complicated, but nothing is ever easy.
Another idea that could solve this neatly. We keep a table of the documents that are triggered for a given alert. So in essence, you only can get an alert for a document once, no matter how many times it gets new pieces of data.
I think this is the solution to this issue I've been looking for.
Here's a service selling these alerts and saying they're "economical" at about $10 each / month:
https://www.courtalert.com/Business-Development-Realtime-Federal-Complaints.asp
We should really get this figured out.
Keep a Solr index just for alerts. It contains one day's worth of results and is cleared each morning at 2am. That'll be about XXX items.
Keep a redis or DB list containing all of the items that have been triggered for a given alert. Use this for two purposes:
Only allow alerts to be created if they don't trigger too many results. If it's more than about 50/day, there's just not much point in an alert. Handle this in the UI and trust users not to work around it.
Features:
Alerts must support real time and daily rates. (Weekly and monthly alerts are possible, but probably not worth the performance hits.)
Must support hundreds of thousands of new items per day and hundreds of results for an alert at a time. This is needed because when we do big data loads, we need to be able to handle those.
@mlissner why's this one closed? Did you decide not to do it, or...? I'm not quite following the comments.
This issue is open and the top one I'm working on when I can find time.
Ha. You're right. I'm a doofus.
Lots more discussion on this today. A few things to note and reiterate:
We have to remember that documents and dockets come to us piece by piece. This means that we cannot just get a new docket or document and send an alert for it. Instead, we need to only send alerts once per docket and once per document. Imagine an alert for "Foo", and a docket that comes in with the case name "foo v. bar." Cool, we send an alert. Then we get the parties for the docket, which includes "Foo" again, so we shouldn't send an alert for the docket again. Later, we get docket entry description, which mentions the term "Foo". We do send an alert for that. Later we get the document text to go along with the docket entry. We do not send an alert for that.
Parent-child searches are not possible in Elastic Search percolators. This is unfortunate because we need to use percolators to trigger alerts on the RECAP and Opinions databases. To solve this, we discussed two possible ways forward:
We create three percolator indexes. One for dockets, one for documents, and a third for parties. Then we split up alerts to only save the relevant fields into the correct index. When we get a new document, we percolate it against all of the percolators, and send an alert if they all hit. Unfortunately, this has a few problems:
docket.case_name=FOO AND document.description=BAR
, we can split that up into the two percolators and if they both hit, send the alert email. But if a user has docket.case_name=FOO OR document.description=BAR
, we should send the alert if both or either percolator hits. The only way to know the difference between these queries is to parse them, which is a bad idea. docket.case_name=Foo
. In that case, we only need to run against the docket percolator, and send alerts based on that. Kind of complicated to sort out.A different approach is to flatten documents before percolating them against a single flattened percolator. This is OK, but it will have false positives because it loses structure. Imagine a query for Party=Mike AND Firm=Sibley
in English you could think of this as "Cases where Mike is represented by Sibley". In a parent-child index, this will work because parties have structure:
docket : {
parties: [
{name: Mike, firm: sibley}
]
}
In a flattened index, you'll have false positives on dockets like:
docket : {
parties: [
{name: Mike, firm: walworth},
{name: Sam, firm: Sibley}
]
}
This is because the flattened version will be:
docket : {
parties: [Mike, Sam],
firms: [Walworth, Sibley],
}
I think this is enough of an edge case that we can just document it, and it should be OK. The good news is it's false positives — we'll send too many alerts. ARE THERE CASES WHERE WE'D HAVE FALSE NEGATIVES?
So the plan going forward is:
@mlissner Here is a summary of the different features and requirements of this project we've been discussing, a brief overview of the architecture could use, and some questions so we can agree on the approach and start working on this project.
Since the percolator doesn't support join queries such as has_child
or has_parent
queries and also because we want to send Docket
alerts and RECAPDocument
alerts independently, we're going to create two percolator indices:
DocketDocumentPercolator
This index will include all the Docket fields that are currently part of the DocketDocument
, including parties. Notice that the parties in this mapping are already flat since we're not using a child document to represent parties. So, false positives like the query you described in an example (Party=Mike AND Firm=Sibley
) above are possible.
ESRECAPDocumentPercolator
This index will include the same fields as ESRECAPDocument
, which are the RECAPDocument
fields plus its parent Docket fields except for parties.
We won't need to perform an additional flattening process either for the percolator mapping or the documents ingested before percolating them.
Following this approach, when a new Docket is created/updated, it will be percolated into the DocketDocumentPercolator
index to check if a query matches the Docket and trigger alerts containing only Dockets.
When a new RECAPDocument is added/updated, the ESRECAPDocument
will be percolated into the ESRECAPDocumentPercolator
index and trigger alerts containing only RECAPDocuments.
However, here are the limitations I found with this approach.
Consider this query: https://www.courtlistener.com/?q=&type=r&order_by=score%20desc&case_name=%22Andromeda%20T.%20Pearson%22&description=%22Notice%20of%20Motion%22
It returns 2 Dockets and 17 Docket entries in the fronted.
But the ESRECAPDocumentPercolator
will be able to check only RECAPDocuments, so we can see that as how the V4 RD type returns the results:
https://www.courtlistener.com/api/rest/v4/search/?q=&type=rd&order_by=score%20desc&case_name=%22Andromeda%20T.%20Pearson%22&description=%22Notice%20of%20Motion%22
It matches the 17 RECAPDocuments. This is possible because the case_name
is indexed within every RECAPDocument.
The problem is the following: Imagine a Docket with 10,000 RECAPDocuments.
Initial Docket case_name: Andromeda T.
RECAPDocuments that belong to this Docket have different descriptions.
Then we receive an upload that updates the Docket case_name to: "Andromeda T. Pearson"
.
We'll percolate the Docket into DocketDocumentPercolator
and if there is a query like:
I'll be matched and trigger the alert, that's good.
However, now all the RECAPDocuments have also been updated with the new Docket case_name
.
So, it's possible that many of those RECAPDocuments
now can match the query:
case_name="Andromeda T. Pearson" + description="Notice of Motion"
To do this, we'll need to percolate every RECAPDocument
that belongs to the Docket
to see which of them matched the query, which means percolating 10,000 RECAPDocuments. I think we could use the percolator query that references the document from the original index to avoid transforming the documents to JSON and percolating from the application.
Q(
"percolate",
field="percolator_query",
index=document_index,
id=document_id,
)
But this query only allows percolating one document at a time. So we'll need to repeat this query 10,000 times in the example scenario (and there can be worse scenarios with many more RDs that would need to be percolated). We can also use the multi-search API to execute many requests at once. I'll need to measure the performance of this approach, but I think it won't be as performant as we wanted and it could use a lot of resources, considering update_by_query
queries to update child documents based on a parent document change are pretty common.
One alternative is to just document this limitation and inform users.
For instance, we could say that RECAPDocument alerts can match Docket fields only when the document is created or updated either by one of the Docket entry fields indexed or a RECAPDocument field. In that way, we'll percolate the document only when it is created, or if the DocketEntry changes, that could mean just a few Percolator queries for the RD that belongs to that DocketEntry. But we'll miss alerts if a Docket field is updated and spread to RECAPDocuments
, which can be considered a false negative.
The other alternative is to abandon the percolator approach and use the inverse query approach, where all the RT alerts are executed every few minutes and triggered. We'll need to evaluate which one (query alerts or percolate hundreds and thousands of documents on parent document updates) is more performant and less resource-intensive to select the more convenient approach.
Another limitation of the percolator approach is related to match queries that combine party fields and RECAPDocument
fields, which can also be considered a false negative. For instance, this query:
Can match a RECAPDocument
by its ID and where its parent Docket contains the party: "KENNETH E. WEST".
But regarding alerts, if the user saves this alert, it won’t ever get a hit.
This problem could be solved only by applying a join query, but unfortunately, they’re not supported by the percolator.
So maybe we could just inform users that an Alert involving a query with a party filter won’t work as in the frontend or the API. Basically, if they want consistent results similar to the ones they could get in the frontend or the API, they should only mix party filters with other Docket fields and avoid using a query string, since a query string could also match RECAPDocument
fields.
Or we could also identify whether an Alert query has a party field and alert the user during its creation about its limitations or avoid creating it.
We’ll need to have a UI where users can save RECAP Search Alerts. During the creation, they can decide if they want to match Dockets
or RECAPDocuments
or both.
Something like:
So we could create one or two alerts with the same query, one for the Docket alert and/or one for the RECAPDocument alert.
The query version we’ll store in the percolator will be the one specific to the document type, excluding all the join queries. We already have these queries that are used to get the Docket and RECAPDocument counts separately. So these queries can be indexed to their percolator index, either DocketDocumentPercolator
or RECAPDocumentAlertPercolator
.
We need to avoid an alert being triggered more than once by the same Docket.
nature_of_suit:copyright
. If when the Docket is created this field is not available, the alert won’t be triggered.nature_of_suit:copyright
value, so the alert is matched and triggered.For doing that, we planned to use a bloom filter that will keep track of the alerts that have been sent so they’re not triggered more than once.
However, I think the bloom filter is possibly not the right approach.
We could have a global bloom filter to store Docket-Alert pairs so we can know when it has already been triggered and avoid triggering it again. The problem with this global filter is that it'll grow too fast since new elements will be added to every alert that is triggered.
So it'd be better to have one bloom filter for each Alert in the database so it can store the docket_ids
that have triggered that alert. In that way, we'd have an equal number of bloom filters to Alerts but they'll be small.
But the problem I see with the bloom filter is:
The problem is false positives because we'll store the docket_id
in the filter once the Docket has triggered that alert. However, a false positive saying that the alert is there but in reality, it is not could lead to avoiding sending alerts that should be sent.
Since false negatives are not possible, there is no possibility of duplicate alerts, which is good. But I recall we discussed that it's more important to not miss any alerts.
We could reduce the probability of getting a false positive by selecting a big bloom filter size and a good hash function, but possibly it's better to just use a SET.
So the alternative approach is just to create a Redis SET for each alert and store each docket_id
that triggered the alert:
alert_1: (400,5600, 232355, 434343, etc.)
Adding new elements to the set or checking if an ID is already in the set can be done in constant time.
SISMEMBER alert_1 400
True
So if an ID is already in the SET, we just omit sending the alert.
Another requirement is grouping RT alerts whenever possible according to #3102. As described in the issue I think the only way to achieve this if we end up using the percolator (in the inverse query this won't be required) is to add a wait before sending the alerts so if more alerts are matched during the waiting time, we could send them in a single email.
Let me know what do you think.
This is a real bummer and you're right that it comes with a bunch of tradeoffs. From a design perspective, I want this to be as seamless as possible. Ideally, people do a query in the front end, create an alert, and it works with some minor tradeoffs or imperfections. I'm afraid that where we're headed is:
If that's where we land, I think we're in trouble, so we have some work ahead of us to sort this out.
I did a little research on the parent-child percolator, and one person said they could use nested queries against the percolator. Is that a crazy idea?
The other alternative is to abandon the percolator approach and use the inverse query approach
This is really not performant, and I want organizations to be able to make 10,000 alerts each, creating millions of alerts. If each alert takes 1s, it'll never work, or at best it'll take a huge number of servers. It's also a bummer that it's not actually real time.
I'd like to avoid users even thinking about dockets vs. documents when they make alerts, but it could be the solution we need. Maybe instead of one button to create alerts, we offer two:
If we do that, I bet most of our users would be satisfied, and it'd be clear that cross-object queries are going to work poorly.
I think if we do this, we remove the docket-related stuff from the RECAP percolator (no case name, etc.)
Q: Can we robustly identify when somebody is making a cross-object query?
Yeah, bloom filters would have been fun. Someday. Redis sets it is.
Spec:
Thank you!
I did a little research on the parent-child percolator, and one person said they could use nested queries against the percolator. Is that a crazy idea?
In this case, we need to convert the parent-child queries to nested queries and evaluate if they match the same documents or if this conversion results in some false positives or false negatives.
However, this approach might still have some performance issues to evaluate. For instance, to percolate a document against the nested query percolator, we need a document structured as parent-nested-child, which means creating a JSON document in memory representing the Docket
with RECAPDocuments
. This would be fine for small cases, but the JSON object can be massive for large dockets with thousands of RECAPDocuments
. Then, we need to send this document to the percolator and hope it's performant enough. We'll need to do this because, in this case, we won't be able to reference the already indexed document to percolate it due to the current Dockets and RECAPDocuments indexed int ES having a different structure than what is required for a nested query percolator.
I'll do some tests around this idea to measure its performance.
I'd like to avoid users even thinking about dockets vs. documents when they make alerts, but it could be the solution we need. Maybe instead of one button to create alerts, we offer two:
Create docket alert Create document alert If we do that, I bet most of our users would be satisfied, and it'd be clear that cross-object queries are going to work poorly.
Great, I like the idea of offering two different buttons to create these alerts. I'll propose some ideas about where and how we can place these buttons in the UI instead of the current bell icon.
I think if we do this, we remove the docket-related stuff from the RECAP percolator (no case name, etc.)
This means that if we can't find a good solution to percolate the original frontend query without the problems described above, we'll end up offering Docket Alerts that only match Docket fields and Document Alerts that only match RECAPDocument fields?
Q: Can we robustly identify when somebody is making a cross-object query?
I'm afraid this is not possible. While we can robustly identify whether the query contains combined parent or child filters, or even within a string query if the user is using advanced syntax:
The problem is that we cannot identify a cross-object query in simple query strings. For instance:
q: Bank of America Expedited Motion
This query can match the string within some Docket fields or RECAPDocument fields. For example, it can match a docket with part of the string in the case_name
(and other parent fields) and also match RECAPDocuments with the whole or partial string within the plain_text
description
, etc.
In cases like these, it's impossible to know (without performing the actual query) whether the query can match only Dockets, only RECAPDocuments, or both.
Spec:
Five minute groups for emails. No wait for webhooks.
Great!
Thanks for your answers and suggestions to explore.
but the JSON object can be massive for large dockets with thousands of RECAPDocuments
At first, I was thinking that If you got changes to a docket, you could just percolate only the docket info, without any documents at all, and that if you got changes to a document, you could just nest that one document within the docket.
But now I'm realizing that if you have a query like:
docket_name: foo
plaintext: bar
You might get this information today:
docket.case_name: baz
document.text: bar
You wouldn't send an alert, because the docket_name doesn't match. But tomorrow the name might be updated to foo
, and you'd want to send the alert.
I think that implies that:
If that's right, I think we're getting close to a solution.
There is one other strategy that we can use here, which is to create a new index each day, and to use that for a sweep (so many sweeps, lately!). The idea here is that querying 500M items is really hard and slow. The only thing you really need to query is the new stuff of the day. So, what you do is:
If we do that in addition to the nested queries, we'd be sure to get everything, and we'd have a somewhat performant solution, since we'd only be querying against a couple hundred thousand items.
Most alerts would be real time. Some cross-object ones even would be, and the corner case would be covered.
What do you think?
This means that if we can't find a good solution to percolate the original frontend query without the problems described above, we'll end up offering Docket Alerts that only match Docket fields and Document Alerts that only match RECAPDocument fields?
Yes. Kind of lame though.
In cases like these, it's impossible to know (without performing the actual query) whether the query can match only Dockets, only RECAPDocuments, or both.
So that would be considered a cross-object query, because it queries across more than one object type.
You wouldn't send an alert, because the docket_name doesn't match. But tomorrow the name might be updated to foo, and you'd want to send the alert.
Yeah, exactly. The problem is directly related to docket field updates that can impact cross-object queries.
I think that implies that:
We can do docket-only and document-only alerts reliably using nested queries We can do cross-object alerts reliably when the new data is a document (just nest it in the docket and do the query). But we can NOT do cross-object alerts reliably when the docket is updated unless we create huge nested objects to percolate. If that's right, I think we're getting close to a solution.
Yeah, the nesting you have in mind (nesting the document into the docket and percolating it) is to allow us to match any cross-object query, including parties, correct?
Because most of the docket fields (except parties) are indexed into each RECAPDocument
, a plain query will be enough to match most queries except for those that include party filters. If so, I agree the nested query seems the right solution.
But we can NOT do cross-object alerts reliably when the docket is updated unless we create huge nested objects to percolate.
Yeah, that's correct.
There is one other strategy that we can use here, which is to create a new index each day, and to use that for a sweep (so many sweeps, lately!). The idea here is that querying 500M items is really hard and slow. The only thing you really need to query is the new stuff of the day. So, what you do is:
This is a pretty good idea!
Just some questions:
At midnight, you run all your cross-object queries against the tiny one.
So we'll need to categorize the alerts into two types: cross-object
alerts and non-cross-object
alerts.
Cross-object
alerts will be all the queries that include either:
q=aaa
If you get alerts, you check if those sent out earlier in the day.
Got it. I think we can use the same set in Redis proposed to avoid duplicates. So we'll have one set per alert that will store either Docket
IDs or RECAPDocument
IDs. This way, it won't matter if the alert was triggered today or in previous days; it won't be triggered again, avoiding duplicates for both the normal process and the midnight sweep.
One question here is how are we going to tag/schedule alerts sent at midnight. We have four alert rates:
In the percolator approach in OA, we do the following:
We trigger webhooks in real-time for all the rates.
I think we can do the same for alerts that are matched in real-time by the percolator.
But what would happen, for instance, for RT cross-object alerts that were missed during the day? Once they're a hit at midnight, will we group all the missed alerts during the day for a user and send a single email? How would we call that email? Because it is not a real-time email anymore, nor is it a daily email, as the alerts don't belong to the daily rate.
If the missed alerts belong to the daily rate, maybe we could execute the midnight sweep and see if some of the daily alerts had hits, then append those hits to the scheduled hits during the day via the percolator and send a single daily email.
For weekly and monthly rates, I think it can work similarly. Use the midnight sweep to store and schedule the hits according to the rate so they can be sent every week or month alongside the ones scheduled by the percolator.
Webhooks Regarding webhooks, once missed hits are matched at midnight, should we send all of their related webhooks at once, regardless of their rate? Which rate should we put in the payload for these?
Yeah, the nesting you have in mind (nesting the document into the docket and percolating it) is to allow us to match any cross-object query, including parties, correct?
Yes.
This is a pretty good idea!
I've been thinking about this for years, but I was hoping not to have to do this, so hadn't mentioned it. But here we are. :)
So we'll need to categorize the alerts into two types: cross-object alerts and non-cross-object alerts.
Yeah, I think so, but if we do a sloppy job that says some docket-only or document-only alerts are actually cross-object, that'd be fine, right? We'd run an extra query, but wouldn't send extra alerts. So long as we err in that direction, we should be fine?
One question here is how are we going to tag/schedule alerts sent at midnight.
Pretty simple. We run our sweep, and send an email with the sweep results. We put extra words in the subject and body to explain what it's about. We continue doing everything with the daily, weekly, and monthly alerts same as before.
Regarding webhooks, once missed hits are matched at midnight, should we send all of their related webhooks at once, regardless of their rate?
Sure, or you can send them in separate payloads. Whatever is easier. I assume it's easier to keep these processes separate.
Which rate should we put in the payload for these?
Real time, and then we document the situation by saying:
"Sometimes cross-object real time alerts will arrive at the end of the day. This is because blah, blah..."
What else??? :)
Yeah, I think so, but if we do a sloppy job that says some docket-only or document-only alerts are actually cross-object, that'd be fine, right? We'd run an extra query, but wouldn't send extra alerts. So long as we err in that direction, we should be fine?
Hmm, I think in that scenario we'd miss alerts.
If we mistakenly tag cross-object
queries as docket-only or document-only, those queries won't run at midnight, leading to missed hits.
On the other hand, if we mistakenly tag docket-only or document-only queries as cross-object, we'll run extra queries, but we won't send duplicates.
So, we should be careful when categorizing the queries or run the sweep over all the queries.
Pretty simple. We run our sweep, and send an email with the sweep results. We put extra words in the subject and body to explain what it's about.
Perfect, this is for the RT rate, right?
We continue doing everything with the daily, weekly, and monthly alerts same as before.
Got it. So, in this case, to continue doing everything for the daily rate as before, we'd just need to ensure the normal daily send is triggered after the midnight sweep so those hits can be included in the daily send. For the weekly and monthly rates, if we want to include the results of that day as well, they should also run after the midnight sweep. We just need to confirm if the sending time is okay because if the midnight sweep runs at 12:00 and takes 15 minutes to complete, we'll need to send the Daily, Weekly, or Monthly emails after 12:15. If that's not okay, they can be included the next day for the daily rate or the next week or month, for the other rates.
Yeah, we want to err on the side of saying something is cross-object if we have any doubt. I agree.
Perfect, this is for the RT rate, right?
Yes, exactly.
We just need to confirm if the sending time is okay because if the midnight sweep runs at 12:00 and takes 15 minutes to complete, we'll need to send the Daily, Weekly, or Monthly emails after 12:15
Yeah, that's fine. Nobody cares if their daily/weekly/monthly alerts are exactly at midnight.
I'd suggest making this one command that does both the sweep and the daily/monthly/weekly alerts, so that it does one task, then the other without having to schedule things and hope the sweep is done before the other one triggers.
Excellent! I think we now have a good plan to work on.
Thank you!
And you. Epic!
Would it make sense to do two PRs? One for regular alerts and one for the sweep?
Yeah, I agree, two PRs make sense for the project!
@mlissner working on adding the Percolator index for RECAP, I have a couple of new questions that can impact the Percolator and the sweep index design. We plan to percolate RD documents nested within a Docket document to trigger alerts for RECAPDocuments reliably or percolate only Docket documents without any nested RD for triggering Docket-only alerts.
Maybe instead of one button to create alerts, we offer two: Create docket alert Create document alert If we do that, I bet most of our users would be satisfied, and it'd be clear that cross-object queries are going to work poorly. I think if we do this, we remove the docket-related stuff from the RECAP percolator (no case name, etc.)
Considering we'll solve the issue related to cross-object queries on document updates by using the daily midnight sweep, should we still divide alerts into two types?
Alternatively, we could have only one type of alert, "RECAP"
and we could send either only Dockets that matched, RD that matched (showing their docket fields), or a combination of Dockets + RD that matched (which can be grouped):
For example, consider the following scenario:
Does it make sense for user needs to still divide alerts for Docket and RDs? I think it would still make sense to split alerts in two types if users want to know which type of object triggered the hit in the alert they're receiving.
The following question is also a bit related but regarding the alert structure and also the percolator design.
- Create docket alert
- Create document alert
Using nested queries (or a plain approach I'm experimenting with) and the midnight sweep, we'll be able to send alerts for non-cross-object and cross-object queries.
I imagine the email for a document alert (RD) like this:
In this case, imagine the RECAPDocuments
don't contain the search query "United States" in any of their fields. But they will match because "United States" is within the Docket case name. Is this correct, even if users created the alert for the document alert type? I think that's correct because it follows the behavior in the frontend where RECAPDocuments
can be matched by Docket fields, which are indexed within the RECAPDocument, so RDs can be matched by only docket fields.
If this is the expected behavior, it will be important to show the Docket fields similarly to the frontend, so users can understand why those documents are being matched even if the keywords don't appear directly within the RDs.
Or should the behavior be that only RECAPDocuments
with fields that match the query will be included in the alert, for instance:
Depending on the expected behavior, the design of the daily sweep index will change. If RDs can be matched by Docket fields, we could simply mirror the current RECAP search index. If the second option is preferred, we would need to switch to an index with a nested documents approach and expect RDs to be matched independently of Docket fields.
And for the Docket-only alert (in case we still need to split alerts) can be as follows:
The main difference is that it will include only Dockets without any entries.
should we still divide alerts into two types?
No. If we can avoid the two alert types, we really should. That was just an idea if we couldn't find a better way forward.
Does it make sense for user needs to still divide alerts for Docket and RDs? I think it would still make sense to split alerts in two types if users want to know which type of object triggered the hit in the alert they're receiving.
I think the emails should try to match the search results as much as possible. So when there's a docket result, it just shows dockets, when it's a document result, it shows the nested document inside the correct docket.
To the user, it should be seamless and they shouldn't think about documents vs. dockets when making or receiving alerts (just like they don't when doing a query).
In this case, imagine the RECAPDocuments don't contain the search query "United States" in any of their fields. But they will match because "United States" is within the Docket case name. Is this correct, even if users created the alert for the document alert type? I think that's correct because it follows the behavior in the frontend where RECAPDocuments can be matched by Docket fields, which are indexed within the RECAPDocument, so RDs can be matched by only docket fields.
I don't think that's ideal, but if it matches the front end, it's OK. Ideally, the email would just have a docket if it only matched on docket fields (and the front end too, I guess).
Or should the behavior be that only RECAPDocuments with fields that match the query will be included in the alert
That is better, yes.
Depending on the expected behavior, the design of the daily sweep index will change.
I think this just depends on how hard it is. We'd like to go for the ideal, correct solution at first. How much more time would you estimate it would take? If it's just a little bit, then let's go for it. If it's more than a few days, maybe it's better to do it as an enhancement down the road?
I think the emails should try to match the search results as much as possible. So when there's a docket result, it just shows dockets, when it's a document result, it shows the nested document inside the correct docket. To the user, it should be seamless and they shouldn't think about documents vs. dockets when making or receiving alerts (just like they don't when doing a query).
Got it. Yeah, I agree, this seems like the better approach.
I don't think that's ideal, but if it matches the front end, it's OK. Ideally, the email would just have a docket if it only matched on docket fields (and the front end too, I guess).
Yeah, this is how the frontend currently behaves. However, I don't think it's an issue in the frontend because the documents matched by docket fields don't affect the meaning of the search; they're just "extra documents." However, in alerts, I can see how it could be confusing because users might think those documents are directly related to the keywords in the query when the only relation is that they belong to the docket.
I think this just depends on how hard it is. We'd like to go for the ideal, correct solution at first. How much more time would you estimate it would take? If it's just a little bit, then let's go for it. If it's more than a few days, maybe it's better to do it as an enhancement down the road?
Well, going for the correct solution, which involves only matching RECAPDocuments
by their own fields while still matching cross-object queries when they should match, implies we'd need nested mapping for both the percolator and the daily sweep index. Additionally, we'd need to apply a kind of grouping when matching alerts for only Dockets or only documents queries that belong to the same case, so they are shown in the same entry in the alert. I estimate this could take about ~2 extra days.
One of the things we should take care of when doing this is ensuring that this new approach follows the results in the frontend as closely as possible without missing anything, except for matching RECAPDocuments
by Docket
fields. I'll be doing some tests around this to confirm that. Also, I can see that using the nested approach in the daily sweep index means that if many documents are added/updated during a day, the whole document Docket and all its child nested documents will be updated every time another document is added or updated. I expect this number to be a maximum of around a hundred documents (considering they're documents for the day), so it shouldn't represent a performance issue.
Great. If it's only two days, let's go for it.
many documents are added/updated during a day, the whole document Docket and all its child nested documents will be updated every time another document is added or updated.
I don't understand what you mean here. Can you explain for me?
Great. If it's only two days, let's go for it.
Perfect!, already working on it.
I don't understand what you mean here. Can you explain for me?
Sure, I meant that a nested document will look something like this:
{
"case_name":"Lorem ipsum",
"docket_number":"21-564",
"documents":[
{
"description":"Test description",
"plain_text":"Test plain",
"document_number":1
},
{
"description":"Test description",
"plain_text":"Test plain",
"document_number":2
}
]
}
The first issue is related to the number of documents nested within the parent document. The more nested documents there are, the more memory is required to handle them within the cluster. The documentation states that the default limit is 10,000 to prevent performance issues: https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html#_limits_on_nested_mappings_and_objects
Since we'll only add documents created or modified during the day, I expect the number of nested documents in a Docket to not be too large and to always remain below 10,000.
The other issue concerns indexing and updates. A document with a nested field is treated as a single unit, so in order to change a parent field or add/update a nested document, Elasticsearch requires performing a complete reindexing of the document. Thus, if a Docket contains too many documents for the day, and we continue adding/updating it, the cluster internally performs a full reindex of this document every time it is changed.
I think we have two options to handle this process:
The first option is to use Painless scripts to add or update nested documents. This would allow us to only send a request to Elasticsearch with the data that should be added or changed. However, internally the cluster will perform the complete reindex for each of these requests.
The second option I'm considering, which could be simpler and more reasonable in terms of resource usage, is to avoid indexing/updating throughout the day into the daily sweep index. Instead, as part of the daily send process, the first step could be to query the database for all unique cases that belong to Dockets
and RECAPDocuments
that were added or changed during the day and index them in a single request per case, including the parent data and all their RECAPDocuments
added or modified during the day. We could use the date_created
and date_modified
fields to determine which content needs to be indexed into the daily sweep index.
The downside of this approach is that if there is too much content added or changed during the day, the time it takes to index everything for the day could be substantial and delay the sending of alerts. For instance, considering that indexing nested documents is significantly more expensive than indexing documents separately, we can estimate an indexing rate for these types of documents at 100 per second. That's 360,000 documents per hour, which I think can be enough to index everything for a regular day. However, if we have special content imports or days with more uploads than usual, this indexing time can extend for several hours before we can send the alerts.
Since we'll only add documents created or modified during the day, I expect the number of nested documents in a Docket to not be too large and to always remain below 10,000.
Yes, that's a very safe assumption. The worst case are bankruptcy cases, which can have something like 100 docs in a day, but that's still not common.
For indexing performance, it sounds like there are three options:
Do the simple thing and just index stuff as it comes in.
Index stuff using painless scripts.
Do it as a batch at the end of the day.
Number 1 is least performant, but simplest. Number 2 saves some bandwidth, but doesn't help the cluster ("internally the cluster will perform the complete reindex for each of these requests"). Number 3 will save the elastic cluster some effort at the cost of the database and batching everything at the end.
My vote is for number 1 because we should avoid doing premature optimizations, and it seems simplest. I also always prefer processes that spread performance over the day instead of doing big pulls all at once, which also favors number 1.
So I'd suggest we go that direction, and if it isn't fast enough we can upgrade to a better solution?
My vote is for number 1 because we should avoid doing premature optimizations, and it seems simplest. I also always prefer processes that spread performance over the day instead of doing big pulls all at once, which also favors number 1.
So I'd suggest we go that direction, and if it isn't fast enough we can upgrade to a better solution?
Got it. Yeah, I agree, option 1 is the simpler solution, and we can perform optimizations if they’re required. Just a note about option 1 that I noticed will be required. Every time we get a Docket or RD add/update during the day, we’ll need to create a JSON file holding the updated state of the case (Docket fields + RDs) for that day. Therefore, a database query that filters out the RECAPDocuments
created or modified during the day will be required every time we get a new add/update to built the correct JSON; otherwise, we’ll end up indexing the whole case in the daily sweep index. But I expect this query to be performant since we have indexes on the date_created
and date_modified
columns.
I thought we had already solved all the important issues here and had a solid plan, but while working on it, more problems and questions have surfaced.
I created in https://github.com/freelawproject/courtlistener/pull/4127 the RECAP Search Alerts sweep index based on the nested approach and also created a compatible nested query approach and tested them.
I found the following:
Most of the tests that involved only docket-only fields text queries, RECAP-only fields text queries, or any combination of filters (only docket, only RECAPDocuments, or combined fields on filters) worked well with no difference from the parent-child approach used in the frontend and the API.
However, tests related to cross-object fields text queries are failing.
One of the reasons we decided to try the nested index approach was to avoid sending false positive alerts when ingesting RECAPDocuments that belong to a docket that could trigger alerts involving only docket fields (which should be triggered only by a docket ingestion). Since those fields are indexed in the regular index into each RECAPDocument, documents could trigger alerts in those cases.
In fact, the nested index approach helps to prevent the problem described above. When using a nested query, it can only reach fields in the child documents, and the parent query component can only reach parent fields. However, this feature of nested documents is also causing cross-object text queries not to work.
For instance, consider the following case document:
case_name: “America v. Lorem”
docket_number: “23-54547”
documents:
document_number:1, description: “Motion Ipsum”
document_number:2, description: “Hearing Ipsum”
Now consider the query:
q=”Motion Ipsum America”
In the current RECAP Search this query will return:
case_name: “America v. Lorem”
docket_number: “23-54547”
documents:
document_number:1, description: “Motion Ipsum”
This is possible because parent fields like case_name
are indexed within each RECAPDocument
, and the has_child
query is structured so that every term in the query is looked into all the searchable fields. Therefore, we’re able to return the right match.
This also allows fielded text queries to work properly:
q=”document_number:2 AND docket_number:23-54547”
Will return:
case_name: “America v. Lorem”
docket_number: “23-54547”
documents:
document_number:2, description: “Hearing Ipsum”
However, I found that this type of cross-object query is failing in the nested index approach because, in the nested approach, the document looks like this:
{
"case_name":"“America v. Lorem”,
"docket_number":"“23-54547”,
"documents":[
{
"document_number":1,
"description":"Motion Ipsum"
},
{
"document_number":2,
"description":"Hearing Ipsum"
}
]
}
Parent fields are not indexed into each nested document.
So a query like:
q=”Motion Ipsum America”
Looks like:
bool:
should:
nested: # Child query.
query: "Motion Ipsum America"
fields:
"documents.short_description",
"documents.plain_text",
"documents.document_type",
"documents.description^2.0"
query_string: # Parent query
query: "Motion Ipsum America"
fields:
"case_name_full",
"suitNature",
"cause",
"juryDemand",
"assignedTo",
"referredTo",
"court",
"court_id",
"court_citation_string",
"chapter",
"trustee_str",
"caseName^4.0",
"docketNumber^3.0"
So the whole phrase "Motion Ipsum America" is not found in any of the child documents or parent documents within their local fields.
It is also not possible to query a parent field from the nested query or, conversely, a nested field within the parent query context.
The solution would be the same as we used in the parent-child approach: index parent fields into each nested document. However, this brings us back to where we began, as we wouldn't be able to avoid triggering alerts for docket-only queries when ingesting any RECAPDocument that contains the docket fields indexed.
In brief:
The nested index approach doesn't offer extra benefits over the parent-child index approach and adds an overhead for indexing/updating documents.
We still need to determine how to solve the issue related to false positive alerts. Here are the alternative solutions I've thought of so far: We could use the same parent-child approach with parent fields indexed into each RECAPDocument for the sweep index. To handle the problem of triggering docket-only query alerts by RD ingestion, we could filter out the alerts before sending them. We could achieve this by using highlighting: since highlights return a list of fields matched by the query, we could check if any of those fields belong directly to an RD. If so, the alert should be sent; otherwise, we could avoid sending it. To do this, we'll need to enable highlighting for all possible fields that can be used in a query (this is required to fully support filtering of advanced fielded queries). I just need to confirm that all fields can be properly highlighted, including IDs and keyword fields, since term vector highlighting is only supported in text fields. Other fields should use plain text highlighting.
So the proposed solution and its trade-offs are explained in the following tables:
Sweep index:
Docket-only fields query | RECAPDocument-only fields query | Cross-object queries | |
---|---|---|---|
Document ingested: Docket | Alert triggered | Alert not triggered, because it just doesn’t match. | We won’t be able to trigger an alert because no RECAPDocument was ingested during the day, so cross-object queries won’t return results. |
Document ingested: RECAPDocument | Alert triggered. In this case, the problem is that when indexing the RD, the Docket is also indexed, and the docket-only query will be matched even if the docket was not ingested during the day. Posible workaround: When ingesting RDs, avoid indexing their parent Docket. Only index Dockets if they’re created or updated during the day. During the midnight sweep, perform two independent plain queries: one targeting Dockets and one targeting RDs. If a match is found in the Docket plain query, it indicates that a Docket was ingested today matched the Docket-only fields query so trigger the alert. If a match is found in the RD plain query, check if any of the HL fields belong to a Docket. If so, avoid sending the alert. This will help prevent sending duplicate alerts for the same alert case. | Alert triggered | Trigger the alert if matched. When indexing a new RECAPDocument, the Docket fields are also indexed/updated within the RD, making it possible to trigger these alerts. |
Document ingested: Both Docket and RECAPdocument indexed during the day | Alert triggered | Alert triggered | Trigger the alert if the Docket and its related RECAPDocuments ingested during the day match the query. |
Percolator:
Docket-only fields query | RECAPDocument-only fields query | Cross-object queries | |
---|---|---|---|
Document ingested: Docket | Alert triggered | Alert not triggered, because it just doesn’t match. | Alerts won’t be triggered because only docket fields are being percolated.This this scenario will be partially handled by the sweep index. For RECAPDocuments that belong to the case and indexing also during the day and match the cross-object query. |
Document ingested: RECAPDocument | Using HL filtering: Alert not triggered No using HL filtering: Alert triggered | Alert triggered | Alert triggered if matched, as it is possible because Docket fields are indexed within each RECAPDocument. |
In summary, the proposed solution considering the trade-offs above will be as follows:
Create the sweep index using the same structure as the regular search index for RECAP.
However, in this approach, we’ll still have a partial issue regarding Docket indexing and cross-object queries.
We’ll be able to trigger alerts for cross-object queries via the sweep index but only for those RECAPDocuments
indexed or updated (independently) during the day.
For instance, consider the following example:
q=case_name: “Lorem Ipsum” AND description
Original case:
case_name: “Lorem”
document_1.description: Motion to…
document_2.description: Motion to…
During the day, the Docket is updated to case_name: “Lorem Ipsum”
and goes to the sweep index.
Also during the day, we get an upload for document_1
and its description is updated to: “Motion to hear…”
and the document goes to the sweep index.
document_2
is not indexed into the sweep index because it didn’t receive an update during the day.
At midnight, the sweep index runs the query: case_name: “Lorem Ipsum” AND description:Motion
And it matches the Docket
and document_1
and sends the alert.
Final questions and considerations:
However, I wonder if document_2
should also be included in the alert because, after the Docket update its case_name
, this document also matches the query. If so, I think the only workaround to this issue is that every time the Docket is updated during the day, all the RECAPDocuments that belong to the case are also indexed into the sweep index. This can be a performance issue for large dockets, similar to percolating a big document (including all RECAPDocuments), with the difference that indexing all the RECAPDocuments of a case can be split into batches for large dockets.
But if the cross-object queries should only match both Docket
and independent RECAPDocuments
indexed/updated during the same day, it will only be required to index objects from the day into the sweep index. This approach will fill the percolator alert gaps that can happen if an RECAPDocument
is created/updated first (and no alert is triggered) and then its parent Docket
is updated (considering that with the update, now the RECAPDocument
+ Docket
can trigger an alert). This alert can be triggered in the midnight sweep.
If the above is correct, we’ll need to document that cross-object queries will match only a combination of Docket
and RECAPDocuments
indexed/updated during the same day.
To solve the Docket-only queries triggered when ingesting RECAPDocuments
and triggered by the percolator, we could use the highlighting filtering as described above.
Or, we could just send the alert every time an RECAPDocument
or a Docket
matches the alert without worrying about whether the matched fields were Docket-only or RECAP-only fields.
Thanks for all the details, and shoot, I guess it's back to plan A.
Using the highlighting to do alert filtering is a great and novel idea. Nice one. Let's do that.
However, I wonder if document_2 should also be included in the alert because, after the Docket update its case_name, this document also matches the query.
You're right, it should be included in this case and we can't document our way out of it, so when this is the case, we'll just have to do the batch updating. A few thoughts:
Can we just update the sweep index once at the end of the day? That'd prevent us ingesting entire dockets into the sweep index multiples times throughout the day, if it changes multiple times.
Is there an API pull data from one index to another? Feels like the kind of thing Elastic would have and a way to make this perform better?
Can we just update the sweep index once at the end of the day? That'd prevent us ingesting entire dockets into the sweep index multiples times throughout the day, if it changes multiple times.
Yeah, I think this is better. Just collect all the dockets that changed during the day and index them at the end of the day into the sweep index along with all their child documents.
Is there an API pull data from one index to another? Feels like the kind of thing Elastic would have and a way to make this perform better?
Sure, I think this is a perfect task for the Reindex API. We have used it in the past to migrate an entire index, but it's possible to use it with a query that selects which documents should be moved. We could just select dockets with a date_modified
greater than the current day and their child documents too. For that, I think we can use a painless script. I'll do some tests to confirm the process.
Thanks!
An update/question here.
We're going to use a Redis SET
to avoid triggering an alert more than once for the same document.
For instance, if alert_1
is triggered by a Docket ID 400
:
The SET will be updated as:
alert_1: (5600, 232355, 434343, 400)
So, if the alert is triggered again by the same Docket ID 400, it won't be sent again.
However, I noticed that we'll need to keep track of the RD IDs because RD-only alerts or cross-object queries can also be triggered by RDs.
So, it is possible that an alert is triggered by an RD in the case, and then it can also be triggered by a different RD in the same case. If we only store the Docket ID that triggered the alert, we won't be able to trigger the alert for different RDs in the same case.
Therefore, I'm thinking of updating the SET to store Docket or RD IDs, so it'll look like this:
alert_1: (d_5600, d_232355, d_434343, d_400, rd_543235, rd_300, rd_2000)
or holding a SET
for each alert:
d_alert_1: (5600, 232355, 434343, 400)
r_alert_1: (543235, 300, 2000)
This way, we can keep track of the Dockets or RDs that triggered the alert independently.
Does that sound right to you? Alerts can be triggered by different RDs in the same case?
Yes, this is exactly right. I think two keys per alert looks tidier, but I'd suggest something more like alert_hits:1.d
and alert_hits:1.rd`, etc?
Following up on the question raised during the RECAP Search Alerts architecture review regarding the Percolator's lack of support for parent-child queries and the possibility of contributing to a solution.
According to https://github.com/elastic/elasticsearch/issues/2960#issuecomment-65052242 the main issue they describe with adding support for parent-child queries is the need to store documents in memory to percolate them one by one.
The approach they seem to be considering involves percolating a parent document. Since the document can only trigger queries involving parent fields, it would be necessary to retrieve all child documents belonging to the parent (from the main documents index), store them in memory, and percolate each one individually to match has_child
queries.
This approach would be resource intensive specially regarding memory and would not scale well, especially with parent documents that have a high cardinality of child documents.
This is now in beta. We're working on the pricing for it and experimenting with it.
I thought about this, but I haven't done it for the moment. The thing that's slowing me down is that most of the PACER alert systems (like Docket Alarm) will go and check a docket for you on some sort of regular basis. I'm afraid that if we create an alert system, people will expect that kind of service. I don't think this is hard though, since we already have alerts for two object types (oral args and opinions).
I also want to create alerts for dockets themselves. This would use the same system as the regular search, just filtered to a query like
docket_id:23378
. (When we do that, we should add alerts for cases, so you can get an alert any time a case is cited -- this functionality already exists, but it should be a simple button on every opinion page.)