Data integrity issues related to 'ConflictError' and 'ConnectionTimeout' errors in Elasticsearch.

@mlissner After we solved the ConnectionErrors on Elastic UBQ requests and some issues related to data integrity. We have gotten a few ConnectionTimeout and ConflictErrors.

Before deciding to increase the request timeout and adding ConnectionTimeout and ConflictErrors to the new retry policy, which spreads the retries according to the volume of documents to update, I reviewed all the errors for each issue. This was to confirm that the requests triggering these errors did not involve data integrity issues, similar to those we had detected previously, where updates from different sources were triggered when the data varied.

COURTLISTENER-5V4 COURTLISTENER-5V5

These are my findings, where the events seem to be data integrity issues.

Pendergrass v. Raffensperger (N.D. Ga. 2021)

The issue here is that the case query report parses the case name as: Pendergrass et al v. Raffensperger Screenshot 2023-12-29 at 19 36 33

While the Docket report as: Pendergrass v. Raffensperger - Restricted Filer Robert Allensworth, see Order 244 Screenshot 2023-12-29 at 19 37 01

61688278_case_query_gand.txt 61688278_docket_gand.txt

In this case, which version name would be correct?

The Bankruptcy Link - Adversary Proceeding (Bankr. S.D.N.Y. 2008)

The parsed case_name from case query is: Securities Investor Protection Corporation v. Bernard L. Madoff Investment Securities, LLC. et al Screenshot 2023-12-29 at 19 43 06

While in the docket report is parsed as : The Bankruptcy Link - Adversary Proceeding Screenshot 2023-12-29 at 19 43 33

4326756_case_query_nysb.txt 4326756_docket_nysb.txt

I'm not sure if it's possible in this case to obtain a standardized name from both sources, so that uploads from these two sources don't trigger unnecessary update tasks.

The Litigation Practice Group P.C. (Bankr. C.D. Cal. 2023)

The problem here is the assigned_to_str field:

From the docket report: "assigned_to_str": "Scott C. Clarkson", Screenshot 2023-12-29 at 19 50 42

From the case query page: "assigned_to_str": "Scott C Clarkson” Screenshot 2023-12-29 at 19 50 22

In both sources, the value is Scott C Clarkson but the parser in the docket report adds a . after C In this case, what would be the correct value?

67051161_docket_cacb.txt 67051161_case_query_cacb.txt

The Litigation Practice Group P.C. - Adversary Proceeding (Bankr. C.D. Cal. 2023)

This case is a combination of previous errors, the assigned_to_str and the case_name changed.

From docket report:

"assigned_to_str": "Scott C. Clarkson"
"case_name": "The Litigation Practice Group P.C. - Adversary Proceeding"

Screenshot 2023-12-29 at 20 08 55

From case query:

"assigned_to_str": "Scott C Clarkson",
"case_name": "Marshack v. Diab",

Screenshot 2023-12-29 at 20 08 20

67487044_case_query_cacb.txt 67487044_docket_cacb.txt

The New York Times Company v. Microsoft Corporation (S.D.N.Y. 2023)

I'm not sure if this one is okay. The docket was updated from:

assigned_to_str: "" Screenshot 2023-12-29 at 20 12 34

assigned_to_str: "Judge Unassigned" Screenshot 2023-12-29 at 20 12 21

I'm not sure if Judge Unassigned is important in the context of the docket or if it's possible to return None when this string is detected.

Also we got an additional event from to this docket, related to the case_name. As can be seen in the screenshots, the case name has changed: Docket 1: The New York Times Company v. MICROSOFT CORPORATION Docket 2: The New York Times Company v. Microsoft Corporation RSS Feed: The New York Times Company v. Microsoft Corporation

The case_name is the same but the letter case changed from MICROSOFT CORPORATION to Microsoft Corporation, this change is detected by the model field tracker as an update triggering an elastic update.

Should we ignore these letter case variations on the tracker field? Or is it correct to trigger the update.

68117049_docket_nysd_2.txt 68117049_docket_nysd_1.txt

Fieldwood Energy LLC and The Official Committee of Unsecured Creditors (Bankr. S.D. Tex. 2020)

This is one is also related to the case_name parsing from the docket report and other source:

Case name from the docket report: Fieldwood Energy LLC Screenshot 2023-12-29 at 20 35 41

Case name from the Claim register: Fieldwood Energy LLC and The Official Committee of Unsecured Creditors

Screenshot 2023-12-29 at 20 35 57

txsb_17411371_claim.txt txsb_17411371_docket.txt

Sklar Exploration Company, LLC (Bankr.D. Colo. 2020)

Also related to the case_name parsing.

From the RSS Feed: Sklar Exploration Company, LLC and Sklarco, LLC Screenshot 2023-12-29 at 20 38 24

From the Docket Report: Sklar Exploration Company, LLC Screenshot 2023-12-29 at 20 37 48

cob_17035263_docket.txt

Let me know what you think about these issues, so we can decide what to do in each case, including whether there's something we can address in Juriscraper or if we should ignore some of them in the field tracker.

Additionally, there were some events that were triggered. However, after checking all possible sources through which a docket can be updated (RSS, RECAP Email, Docket reports, Case query pages, RECAP Fetch), I didn't find any data variation that could explain why the task was triggered. Therefore, I'm wondering if it would be possible to examine the pghistory Docket table to trace if the value changed.

These cases are:

In re Roundup Products Liability Litigation (N.D. Cal. 2016)

Value that changed according to the task: docket_number Around: Dec 24, 8:32 AM UTC Dec 26, 10:42 AM UTC

In Re: Terrorist Attacks on September 11, 2001 (S.D.N.Y. 2023)

Value that changed according to the task: docket_number Around: Dec 22, 8:47 AM Dec 24, 8:32 AM Dec 25, 8:21 AM

Pendergrass v. Raffensperger (N.D. Ga. 2021)

The issue here is that the case query report parses the case name as: Pendergrass et al v. Raffensperger

While the Docket report as: Pendergrass v. Raffensperger - Restricted Filer Robert Allensworth, see Order 244

In this case, which version name would be correct?

Just "Pendegrass et al v. Raffensperger" is correct, but is there anything we can do about that?

The Bankruptcy Link - Adversary Proceeding (Bankr. S.D.N.Y. 2008)

The parsed case_name from case query is: Securities Investor Protection Corporation v. Bernard L. Madoff Investment Securities, LLC. et al

While in the docket report is parsed as : The Bankruptcy Link - Adversary Proceeding

I'm not sure if it's possible in this case to obtain a standardized name from both sources, so that uploads from these two sources don't trigger unnecessary update tasks.

Yeah, the full name is correct. No idea why the docket report is so messed up.

The Litigation Practice Group P.C. (Bankr. C.D. Cal. 2023)

The problem here is the assigned_to_str field:

From the docket report: `"assigned_to_str": "Scott C. Clarkson"

From the case query page: "assigned_to_str": "Scott C Clarkson”

In both sources, the value is Scott C Clarkson but the parser in the docket report adds a . after C In this case, what would be the correct value?

It's interesting we add that period. I guess we should either do that everywhere or nowhere. It's nice that we add it, if we can always do so without making mistakes (which...maybe we can?).

The Litigation Practice Group P.C. - Adversary Proceeding (Bankr. C.D. Cal. 2023)

This case is a combination of previous errors, the assigned_to_str and the case_name changed.

From docket report:
"assigned_to_str": "Scott C. Clarkson"
"case_name": "The Litigation Practice Group P.C. - Adversary Proceeding"
From case query:
"assigned_to_str": "Scott C Clarkson",
"case_name": "Marshack v. Diab",

Yeah, same thoughts as above.

The New York Times Company v. Microsoft Corporation (S.D.N.Y. 2023)

I'm not sure if this one is okay. The docket was updated from:

assigned_to_str: ""

assigned_to_str: "Judge Unassigned"

I'm not sure if Judge Unassigned is important in the context of the docket or if it's possible to return None when this string is detected.

If we can get the value "Judge Unassigned" from all sources, we should use that, since it provides meaning to our users. If we can't get that value consistently, then it's OK to normalize to a null value, if we have to choose something.

Also we got an additional event from to this docket, related to the case_name. As can be seen in the screenshots, the case name has changed: Docket 1: The New York Times Company v. MICROSOFT CORPORATION Docket 2: The New York Times Company v. Microsoft Corporation RSS Feed: The New York Times Company v. Microsoft Corporation

The case_name is the same but the letter case changed from MICROSOFT CORPORATION to Microsoft Corporation, this change is detected by the model field tracker as an update triggering an elastic update.

Should we ignore these letter case variations on the tracker field? Or is it correct to trigger the update?

Well, Elastic doesn't care much about case variations, so I guess it's fine and quite reasonable to ignore them. That said, I think we're probably the ones doing the normalization.

Fieldwood Energy LLC and The Official Committee of Unsecured Creditors (Bankr. S.D. Tex. 2020)

This is one is also related to the case_name parsing from the docket report and other source:

Case name from the docket report: Fieldwood Energy LLC

Case name from the Claim register: Fieldwood Energy LLC and The Official Committee of Unsecured Creditors

The longer name is better, but we can't trust that as our heuristic, I'm afraid.

Sklar Exploration Company, LLC (Bankr.D. Colo. 2020)

From the RSS Feed: Sklar Exploration Company, LLC and Sklarco, LLC

From the Docket Report: Sklar Exploration Company, LLC

Again, the longer one is best.

Let me know what you think about these issues, so we can decide what to do in each case, including whether there's something we can address in Juriscraper or if we should ignore some of them in the field tracker.

Generally, I think we can ignore the difference based on uppercase vs lower case, and it'd be nice to normalize judge names more consistently, but the rest seem hard to ignore.

It feels a bit weird to be looking at data issues in this way though. Shouldn't we be able to just push whatever is in the DB into Elastic even if different parsers have different values?

I'm wondering if it would be possible to examine the pghistory Docket table to trace if the value changed.

Bill can help with this while I'm out, but, again, I feel like delving into data tweaks at this level is useful to make sure the tracker is working properly, but feels like we've gone astray from having Elastic just mirror the DB?

Thanks for your answers.

It feels a bit weird to be looking at data issues in this way though. Shouldn't we be able to just push whatever is in the DB into Elastic even if different parsers have different values?

I feel like delving into data tweaks at this level is useful to make sure the tracker is working properly, but feels like we've gone astray from having Elastic just mirror the DB?

Yeah, this is correct. We are already mirroring the database and update ES with the latest value it doesn't matter the value or their sources. The reason I tried to debug this way is that from previous experiences in these errors, the volume of actual docket fields updates happening in a short period to of time to trigger a ConflicError is very rare, so these are opportunities to detect data differences from sources and assess if they are correct or if they can be fixed. Before adding ConflictError and ConnectionTimeout to the new retry policy, consider that these errors are going to be rarer to detect.

So about the cases:

Just "Pendegrass et al v. Raffensperger" is correct, but is there anything we can do about that?

<br>Pendergrass et al v. Raffensperger et al <font color="red"> - Restricted Filer Robert Allensworth, see Order #244</font>

In this case it seems is possible to try to use in the Docket Report the parsing method we're using in the Case query page. So it ignores the content is within the font tag. In that way we can obtain the same string in these cases e.g: "Pendegrass et al v. Raffensperger" from both sources.

Securities Investor Protection Corporation v. Bernard L. Madoff Investment Securities, LLC. et al The Bankruptcy Link - Adversary Proceeding Yeah, the full name is correct. No idea why the docket report is so messed up.

Yes, it appears that the full name is not available in the docket report as it is on the 'Case Query' page. The Juriscraper helper _get_case_name employs various alternatives to compose the case name. However, in this instance, it seems it's not retrieving the correct name. There is code to obtain the case_name using the format 'Plaintiff v. Defendant'. In this way, the case name would be the same as on the Case Query page. But in this case, _get_case_name is not functioning as expected and is instead using previous conditions to retrieve the case_name.

Scott C. Clarkson It's interesting we add that period. I guess we should either do that everywhere or nowhere. It's nice that we add it, if we can always do so without making mistakes (which...maybe we can?).

Got it, yeah we can add also the period in the assigned_to_str from the Case Query page just as we do in the Docket report.

Docket Report:
"assigned_to_str": "Scott C. Clarkson"
"case_name": "The Litigation Practice Group P.C. - Adversary Proceeding"

Case Query:
"assigned_to_str": "Scott C Clarkson"
"case_name": "Marshack v. Diab"

Yeah, same thoughts as above.

Yes, the solution to add the period to assigned_to_str is the same here and the about the case_name as described above. The case name from the Case Query is composed of Plaintiff v. Defendant but here is a bit more complicated because the case query name only uses the last names and the names available in the docket report are full names:

Plaintiff: Richard A. Marshack, Chapter 11 Trustee
Defendant:  Tony Diab

assigned_to_str: ""
assigned_to_str: "Judge Unassigned"
If we can get the value "Judge Unassigned" from all sources, we should use that, since it provides meaning to our users. If we can't get that value consistently, then it's OK to normalize to a null value, if we have to choose something.

Great, I checked a Case Query page around the time when Judge Unassigned was set in the docket report, and at that time, the value was not present on the Case Query page. Currently, this docket has a judge assigned, and it is no longer listed as Judge Unassigned. Therefore, this doesn't appear to be an issue related about triggering unnecessary updates.

The New York Times Company v. MICROSOFT CORPORATION
The New York Times Company v. Microsoft Corporation
Well, Elastic doesn't care much about case variations, so I guess it's fine and quite reasonable to ignore them. That said, I think we're probably the ones doing the normalization.

Got it. So, in this case, should the case normalization occur in Juriscraper? This way, when the values are entered into the database, they are not detected as changes. Or should we store the values as we receive them from Juriscraper, and simply perform the normalization before comparing the values in the field tracker? Just to ensure that no update is triggered in case of variations.

`Fieldwood Energy LLC`
`Fieldwood Energy LLC and The Official Committee of Unsecured Creditors`
The longer name is better, but we can't trust that as our heuristic, I'm afraid.

Yeah in this case the longer name is in only available in the Claim register, and the shorter one is in the Docket Report which currently is taking the name from the Debtor field, other part of the name (The Official Committee of Unsecured Creditors) is available in the docket report as the Creditor committee, in this case there is not Defendant. So maybe one option is to use the first Creditor committee when there is no Defendant available in the report?

Screenshot 2024-01-02 at 10 09 31

From the RSS Feed: Sklar Exploration Company, LLC and Sklarco, LLC
From the Docket Report: Sklar Exploration Company, LLC

Again, the longer one is best.

Upon reviewing the docket report, I noticed it differs from previous cases. Here, the second name, Sklarco, LLC, is also a debtor, and there is also a Creditor Committee named The Official Committee of Unsecured Creditors; there is no defendant. This complicates things because if we want to improve the parsing in Juriscraper, how would we know whether to choose a second Debtor to compose the name, or use the Creditor Committee? Is there a reliable method to decide which field to use for composing the longer name?

In brief:

I can work on ignore the uppercase vs lower case in case names just confirm if we should normalize the case from Juriscraper or just at the field tracker level.
I can work also on adding the period to the judge names in all the sources like in the case query page.
Also the first case name "Pendegrass et al v. Raffensperger" seems a fix we can do in Juriscraper, ignoring the part of the name that's within the font tag.
About the differences in bankruptcy names, it seems that we could improve the name composing in Juriscraper, however it seems a bit complicated to determine which fields to use to get the right name.

Let me know what do you think.

these are opportunities to detect data differences from sources and assess if they are correct or if they can be fixed.

Ok, lovely, thank you, but for these parser bugs, if you can file them as three separate bugs (except the field tracker, which falls on your plate), @grossir can do the coding and you can review.

@flooie, these should be small and fairly easy bugs that Gianfranco can do to get into the RECAP code. They'd help with our Elastic indexing and data quality.

Responses to your thoughts....

I can work on ignore the uppercase vs lower case in case names just confirm if we should normalize the case from Juriscraper or just at the field tracker level.

Juriscraper should be consistent, and I think it's reasonable for it to always do the titlecase thing.
We should also not bother updating case changes into elastic, so if it's possible to do that, we should (maybe it's really hard, in which case, I'm guessing it's not worth it).

I can work also on adding the period to the judge names in all the sources like in the case query page.

Cool.

Also the first case name "Pendegrass et al v. Raffensperger" seems a fix we can do in Juriscraper, ignoring the part of the name that's within the font tag.

Great.

About the differences in bankruptcy names, it seems that we could improve the name composing in Juriscraper, however it seems a bit complicated to determine which fields to use to get the right name.

I agree. I don't know the best answer here, and we're doing OK. I'd leave it be until somebody wise in bankruptcy helps.

Thank you all!

Great! I've created the 3 Juriscraper issues, related to this one.

https://github.com/freelawproject/juriscraper/issues/846 https://github.com/freelawproject/juriscraper/issues/845 https://github.com/freelawproject/juriscraper/issues/844

@grossir let me know if you have any questions.

I'll work on checking the field tracker is working properly on docket_number changes using the pghistory table.

We continue experiencing ConflictErrors as you can see in COURTLISTENER-5V5

I downloaded and triaged the events from the last two months (since the last time we checked them) to review if there could be any other data integrity issues that these errors can help us detect.

I processed the events to remove duplicates from the same instance: all_events_unique.txt

In that list we can see events like:

"ESRECAPDocument-17096272": [
        "27b5d25a8f30454c9732e18abb001b2b",
        [
            "assigned_to_str"
        ]
    ]

Which indicates the parent document being updated, the Sentry event ID, and the fields being updated.

Most of the events are related to the following fields: assigned_to_str case_name

I didn't review them in detail, assuming they're related to the same issues described in this issue earlier and that we have created Juriscraper issues to fix them.

There are other events that only update one field at a time. I reviewed some of them, and they seem ok since the data being added appears to be new.

However, there were some that caught my attention because they updated many fields at a time, so I reviewed them in detail:

https://www.courtlistener.com/docket/18615350/cred-inc-adversary-proceeding/

The current content in the docket belongs to this upload:

Screenshot 2024-02-22 at 15 41 55

But previously, the same docket received this upload, which has different docket_number, date_filed, and case_name:

Screenshot 2024-02-22 at 15 42 27

https://www.courtlistener.com/docket/63219183/herbert-l-whitehead-iii/ The current content in the docket belongs to this upload:

Screenshot 2024-02-22 at 15 46 08

The same docket has also received the following uploads, which seem to be from different cases.

Screenshot 2024-02-22 at 15 46 57

Screenshot 2024-02-22 at 15 47 22

https://www.courtlistener.com/docket/64946857/brittany-n-prichett/ The current content in the docket belongs to this upload:

Screenshot 2024-02-22 at 15 49 58

The case has also received uploads from case query pages that contain different data:

Screenshot 2024-02-22 at 15 50 32

https://www.courtlistener.com/docket/67479778/oconally-v-united-states-department-of-education/ Current content from:

Screenshot 2024-02-22 at 15 55 04

But it has also received uploads from:

Screenshot 2024-02-22 at 15 57 30

The difference in uploads is leading to field changes in the database, which triggered the ConflictError because multiple uploads were sent almost simultaneously.

Is it possible that the pacer_case_id is the same for these cases, which seem different, or are they related somehow?

My original thinking about the ConflictError was that, by addressing the root issues of these data changes, we would gradually see fewer events of this type. Thus, we could continue to investigate just the remaining issues that persist to uncover more data integrity problems or simply discard them if they're unrelated to the data. However, since we still have some unresolved issues, the volume of ConflictErrors has not decreased, making it hard to examine every event in detail and determine whether it's ok or something that should be fixed downstream.

So, my question is whether we should continue to trigger and review these errors, or should we shift the exception to the new retry policy, which adjusts the retries according to the volume of documents to update. If we do that, we'll likely see fewer errors of this type, but we might overlook some other data integrity issues that haven't been discovered yet.

These errors look like they're coming from the RECAP extension, not from Elastic, right?

Correct, the errors described here came from the extension, similar to the previous data integrity issues that originated in Juriscraper.

We're only taking advantage of the ES ConflictError to discover these issues since they are triggered by data changes.

I think I'd say we should let the errors keep rolling in, but that they shouldn't stand in the way of our re-index. Eventually, we should try to look upstream and see what's causing these, if we can.

Does that sound right to you?

Yeah, that sounds good. Most of the errors triggered here do not prevent the data from being up to date in Elasticsearch. This is because if a ConflictError occurs, it indicates that another UBQ request was triggered almost simultaneously and succeeded. Only in rare cases, where the data from two simultaneous requests differ, might we miss something.

These are the parsing differences described in #3143 related to FreeOpinionReport

Opinions Report:

Screenshot 2024-03-07 at 15 48 37

Docket Report: Screenshot 2024-03-07 at 15 39 41

For instance, in the case of 2:23-cv-08966-DSF-E, the docket_number in both the Opinions Report and the Docket Report is the same: 2:23-cv-08966-DSF-E. However, the docket number returned by the FreeOpinionReport is 2:23-cv-08966-DSF-E, while the one returned by the DocketReport is 2:23-cv-08966.

Another field that differs is nature_of_suit. For instance, the one from the FreeOpinionReport is Copyright, while the one returned by the DocketReport is 820 Copyright. Not sure whether it is possible to standardize this field or not.

Yeah, let's normalize the docket number, that's an easy fix.

For nature of suit, how about: If there is a value in the field, don't update it. It never changes.

That'd take care of this source of issues, right?

That'd take care of this source of issues, right?

Yeah, I've already create the related issues for these fixes so you can prioritize them accordingly.

freelawproject / courtlistener