Open albertisfu opened 11 months ago
Additionally, there were some events that were triggered. However, after checking all possible sources through which a docket can be updated (RSS, RECAP Email, Docket reports, Case query pages, RECAP Fetch), I didn't find any data variation that could explain why the task was triggered. Therefore, I'm wondering if it would be possible to examine the pghistory
Docket table to trace if the value changed.
These cases are:
In re Roundup Products Liability Litigation (N.D. Cal. 2016)
Value that changed according to the task: docket_number
Around:
Dec 24, 8:32 AM UTC
Dec 26, 10:42 AM UTC
In Re: Terrorist Attacks on September 11, 2001 (S.D.N.Y. 2023)
Value that changed according to the task: docket_number
Around:
Dec 22, 8:47 AM
Dec 24, 8:32 AM
Dec 25, 8:21 AM
Pendergrass v. Raffensperger (N.D. Ga. 2021)
The issue here is that the case query report parses the case name as:
Pendergrass et al v. Raffensperger
While the Docket report as:
Pendergrass v. Raffensperger - Restricted Filer Robert Allensworth, see Order 244
In this case, which version name would be correct?
Just "Pendegrass et al v. Raffensperger" is correct, but is there anything we can do about that?
The Bankruptcy Link - Adversary Proceeding (Bankr. S.D.N.Y. 2008)
The parsed
case_name
from case query is:Securities Investor Protection Corporation v. Bernard L. Madoff Investment Securities, LLC. et al
While in the docket report is parsed as :
The Bankruptcy Link - Adversary Proceeding
I'm not sure if it's possible in this case to obtain a standardized name from both sources, so that uploads from these two sources don't trigger unnecessary update tasks.
Yeah, the full name is correct. No idea why the docket report is so messed up.
The Litigation Practice Group P.C. (Bankr. C.D. Cal. 2023)
The problem here is the
assigned_to_str
field:From the docket report: `"assigned_to_str": "Scott C. Clarkson"
From the case query page:
"assigned_to_str": "Scott C Clarkson”
In both sources, the value is
Scott C Clarkson
but the parser in the docket report adds a.
afterC
In this case, what would be the correct value?
It's interesting we add that period. I guess we should either do that everywhere or nowhere. It's nice that we add it, if we can always do so without making mistakes (which...maybe we can?).
The Litigation Practice Group P.C. - Adversary Proceeding (Bankr. C.D. Cal. 2023)
This case is a combination of previous errors, the
assigned_to_str
and thecase_name
changed.From docket report:
"assigned_to_str": "Scott C. Clarkson" "case_name": "The Litigation Practice Group P.C. - Adversary Proceeding"
From case query:
"assigned_to_str": "Scott C Clarkson", "case_name": "Marshack v. Diab",
Yeah, same thoughts as above.
The New York Times Company v. Microsoft Corporation (S.D.N.Y. 2023)
I'm not sure if this one is okay. The docket was updated from:
assigned_to_str: ""
assigned_to_str: "Judge Unassigned"
I'm not sure if
Judge Unassigned
is important in the context of the docket or if it's possible to returnNone
when this string is detected.
If we can get the value "Judge Unassigned" from all sources, we should use that, since it provides meaning to our users. If we can't get that value consistently, then it's OK to normalize to a null value, if we have to choose something.
Also we got an additional event from to this docket, related to the
case_name
. As can be seen in the screenshots, the case name has changed: Docket 1:The New York Times Company v. MICROSOFT CORPORATION
Docket 2:The New York Times Company v. Microsoft Corporation
RSS Feed:The New York Times Company v. Microsoft Corporation
The
case_name
is the same but the letter case changed fromMICROSOFT CORPORATION
toMicrosoft Corporation
, this change is detected by the model field tracker as an update triggering an elastic update.Should we ignore these letter case variations on the tracker field? Or is it correct to trigger the update?
Well, Elastic doesn't care much about case variations, so I guess it's fine and quite reasonable to ignore them. That said, I think we're probably the ones doing the normalization.
Fieldwood Energy LLC and The Official Committee of Unsecured Creditors (Bankr. S.D. Tex. 2020)
This is one is also related to the
case_name
parsing from the docket report and other source:Case name from the docket report:
Fieldwood Energy LLC
Case name from the Claim register:
Fieldwood Energy LLC and The Official Committee of Unsecured Creditors
The longer name is better, but we can't trust that as our heuristic, I'm afraid.
Sklar Exploration Company, LLC (Bankr.D. Colo. 2020)
From the RSS Feed:
Sklar Exploration Company, LLC and Sklarco, LLC
From the Docket Report:
Sklar Exploration Company, LLC
Again, the longer one is best.
Let me know what you think about these issues, so we can decide what to do in each case, including whether there's something we can address in Juriscraper or if we should ignore some of them in the field tracker.
Generally, I think we can ignore the difference based on uppercase vs lower case, and it'd be nice to normalize judge names more consistently, but the rest seem hard to ignore.
It feels a bit weird to be looking at data issues in this way though. Shouldn't we be able to just push whatever is in the DB into Elastic even if different parsers have different values?
I'm wondering if it would be possible to examine the pghistory Docket table to trace if the value changed.
Bill can help with this while I'm out, but, again, I feel like delving into data tweaks at this level is useful to make sure the tracker is working properly, but feels like we've gone astray from having Elastic just mirror the DB?
Thanks for your answers.
It feels a bit weird to be looking at data issues in this way though. Shouldn't we be able to just push whatever is in the DB into Elastic even if different parsers have different values?
I feel like delving into data tweaks at this level is useful to make sure the tracker is working properly, but feels like we've gone astray from having Elastic just mirror the DB?
Yeah, this is correct. We are already mirroring the database and update ES with the latest value it doesn't matter the value or their sources. The reason I tried to debug this way is that from previous experiences in these errors, the volume of actual docket fields updates happening in a short period to of time to trigger a ConflicError
is very rare, so these are opportunities to detect data differences from sources and assess if they are correct or if they can be fixed.
Before adding ConflictError
and ConnectionTimeout
to the new retry policy, consider that these errors are going to be rarer to detect.
So about the cases:
Just "Pendegrass et al v. Raffensperger" is correct, but is there anything we can do about that?
<br>Pendergrass et al v. Raffensperger et al <font color="red"> - Restricted Filer Robert Allensworth, see Order #244</font>
In this case it seems is possible to try to use in the Docket Report
the parsing method we're using in the Case query
page. So it ignores the content is within the font
tag. In that way we can obtain the same string in these cases e.g: "Pendegrass et al v. Raffensperger"
from both sources.
Securities Investor Protection Corporation v. Bernard L. Madoff Investment Securities, LLC. et al
The Bankruptcy Link - Adversary Proceeding
Yeah, the full name is correct. No idea why the docket report is so messed up.
Yes, it appears that the full name is not available in the docket report as it is on the 'Case Query' page. The Juriscraper helper _get_case_name
employs various alternatives to compose the case name. However, in this instance, it seems it's not retrieving the correct name. There is code to obtain the case_name
using the format 'Plaintiff v. Defendant'. In this way, the case name would be the same as on the Case Query page. But in this case, _get_case_name
is not functioning as expected and is instead using previous conditions to retrieve the case_name
.
Scott C. Clarkson
It's interesting we add that period. I guess we should either do that everywhere or nowhere. It's nice that we add it, if we can always do so without making mistakes (which...maybe we can?).
Got it, yeah we can add also the period in the assigned_to_str
from the Case Query
page just as we do in the Docket report.
Docket Report: "assigned_to_str": "Scott C. Clarkson" "case_name": "The Litigation Practice Group P.C. - Adversary Proceeding" Case Query: "assigned_to_str": "Scott C Clarkson" "case_name": "Marshack v. Diab"
Yeah, same thoughts as above.
Yes, the solution to add the period to assigned_to_str
is the same here and the about the case_name
as described above. The case name from the Case Query is composed of Plaintiff v. Defendant
but here is a bit more complicated because the case query
name only uses the last names and the names available in the docket report are full names:
Plaintiff: Richard A. Marshack, Chapter 11 Trustee
Defendant: Tony Diab
assigned_to_str: "" assigned_to_str: "Judge Unassigned"
If we can get the value "Judge Unassigned" from all sources, we should use that, since it provides meaning to our users. If we can't get that value consistently, then it's OK to normalize to a null value, if we have to choose something.
Great, I checked a Case Query
page around the time when Judge Unassigned
was set in the docket report, and at that time, the value was not present on the Case Query page. Currently, this docket has a judge assigned, and it is no longer listed as Judge Unassigned
. Therefore, this doesn't appear to be an issue related about triggering unnecessary updates.
The New York Times Company v. MICROSOFT CORPORATION The New York Times Company v. Microsoft Corporation
Well, Elastic doesn't care much about case variations, so I guess it's fine and quite reasonable to ignore them. That said, I think we're probably the ones doing the normalization.
Got it. So, in this case, should the case normalization occur in Juriscraper? This way, when the values are entered into the database, they are not detected as changes. Or should we store the values as we receive them from Juriscraper, and simply perform the normalization before comparing the values in the field tracker? Just to ensure that no update is triggered in case of variations.
`Fieldwood Energy LLC`
`Fieldwood Energy LLC and The Official Committee of Unsecured Creditors`
The longer name is better, but we can't trust that as our heuristic, I'm afraid.
Yeah in this case the longer name is in only available in the Claim register, and the shorter one is in the Docket Report which currently is taking the name from the Debtor
field, other part of the name (The Official Committee of Unsecured Creditors) is available in the docket report as the Creditor committee
, in this case there is not Defendant. So maybe one option is to use the first Creditor committee
when there is no Defendant
available in the report?
From the RSS Feed: Sklar Exploration Company, LLC and Sklarco, LLC From the Docket Report: Sklar Exploration Company, LLC
Again, the longer one is best.
Upon reviewing the docket report, I noticed it differs from previous cases. Here, the second name, Sklarco, LLC
, is also a debtor, and there is also a Creditor Committee
named The Official Committee of Unsecured Creditors
; there is no defendant. This complicates things because if we want to improve the parsing in Juriscraper, how would we know whether to choose a second Debtor
to compose the name, or use the Creditor Committee
? Is there a reliable method to decide which field to use for composing the longer name?
In brief:
Let me know what do you think.
these are opportunities to detect data differences from sources and assess if they are correct or if they can be fixed.
Ok, lovely, thank you, but for these parser bugs, if you can file them as three separate bugs (except the field tracker, which falls on your plate), @grossir can do the coding and you can review.
@flooie, these should be small and fairly easy bugs that Gianfranco can do to get into the RECAP code. They'd help with our Elastic indexing and data quality.
Responses to your thoughts....
I can work on ignore the uppercase vs lower case in case names just confirm if we should normalize the case from Juriscraper or just at the field tracker level.
Juriscraper should be consistent, and I think it's reasonable for it to always do the titlecase thing.
We should also not bother updating case changes into elastic, so if it's possible to do that, we should (maybe it's really hard, in which case, I'm guessing it's not worth it).
I can work also on adding the period to the judge names in all the sources like in the case query page.
Cool.
Also the first case name "Pendegrass et al v. Raffensperger" seems a fix we can do in Juriscraper, ignoring the part of the name that's within the font tag.
Great.
About the differences in bankruptcy names, it seems that we could improve the name composing in Juriscraper, however it seems a bit complicated to determine which fields to use to get the right name.
I agree. I don't know the best answer here, and we're doing OK. I'd leave it be until somebody wise in bankruptcy helps.
Thank you all!
Great! I've created the 3 Juriscraper issues, related to this one.
https://github.com/freelawproject/juriscraper/issues/846 https://github.com/freelawproject/juriscraper/issues/845 https://github.com/freelawproject/juriscraper/issues/844
@grossir let me know if you have any questions.
I'll work on checking the field tracker is working properly on docket_number
changes using the pghistory
table.
We continue experiencing ConflictErrors
as you can see in COURTLISTENER-5V5
I downloaded and triaged the events from the last two months (since the last time we checked them) to review if there could be any other data integrity issues that these errors can help us detect.
I processed the events to remove duplicates from the same instance: all_events_unique.txt
In that list we can see events like:
"ESRECAPDocument-17096272": [
"27b5d25a8f30454c9732e18abb001b2b",
[
"assigned_to_str"
]
]
Which indicates the parent document being updated, the Sentry event ID, and the fields being updated.
Most of the events are related to the following fields:
assigned_to_str
case_name
I didn't review them in detail, assuming they're related to the same issues described in this issue earlier and that we have created Juriscraper issues to fix them.
There are other events that only update one field at a time. I reviewed some of them, and they seem ok since the data being added appears to be new.
However, there were some that caught my attention because they updated many fields at a time, so I reviewed them in detail:
https://www.courtlistener.com/docket/18615350/cred-inc-adversary-proceeding/
The current content in the docket belongs to this upload:
But previously, the same docket received this upload, which has different docket_number
, date_filed
, and case_name
:
https://www.courtlistener.com/docket/63219183/herbert-l-whitehead-iii/ The current content in the docket belongs to this upload:
The same docket has also received the following uploads, which seem to be from different cases.
https://www.courtlistener.com/docket/64946857/brittany-n-prichett/ The current content in the docket belongs to this upload:
The case has also received uploads from case query pages that contain different data:
https://www.courtlistener.com/docket/67479778/oconally-v-united-states-department-of-education/ Current content from:
But it has also received uploads from:
The difference in uploads is leading to field changes in the database, which triggered the ConflictError
because multiple uploads were sent almost simultaneously.
Is it possible that the pacer_case_id
is the same for these cases, which seem different, or are they related somehow?
My original thinking about the ConflictError
was that, by addressing the root issues of these data changes, we would gradually see fewer events of this type. Thus, we could continue to investigate just the remaining issues that persist to uncover more data integrity problems or simply discard them if they're unrelated to the data. However, since we still have some unresolved issues, the volume of ConflictErrors
has not decreased, making it hard to examine every event in detail and determine whether it's ok or something that should be fixed downstream.
So, my question is whether we should continue to trigger and review these errors, or should we shift the exception to the new retry policy, which adjusts the retries according to the volume of documents to update. If we do that, we'll likely see fewer errors of this type, but we might overlook some other data integrity issues that haven't been discovered yet.
These errors look like they're coming from the RECAP extension, not from Elastic, right?
Correct, the errors described here came from the extension, similar to the previous data integrity issues that originated in Juriscraper.
We're only taking advantage of the ES ConflictError
to discover these issues since they are triggered by data changes.
I think I'd say we should let the errors keep rolling in, but that they shouldn't stand in the way of our re-index. Eventually, we should try to look upstream and see what's causing these, if we can.
Does that sound right to you?
Yeah, that sounds good. Most of the errors triggered here do not prevent the data from being up to date in Elasticsearch. This is because if a ConflictError
occurs, it indicates that another UBQ
request was triggered almost simultaneously and succeeded. Only in rare cases, where the data from two simultaneous requests differ, might we miss something.
These are the parsing differences described in #3143 related to FreeOpinionReport
Opinions Report:
Docket Report:
For instance, in the case of 2:23-cv-08966-DSF-E, the docket_number
in both the Opinions Report and the Docket Report is the same: 2:23-cv-08966-DSF-E
.
However, the docket number returned by the FreeOpinionReport
is 2:23-cv-08966-DSF-E
, while the one returned by the DocketReport is 2:23-cv-08966
.
Another field that differs is nature_of_suit
. For instance, the one from the FreeOpinionReport
is Copyright
, while the one returned by the DocketReport
is 820 Copyright
. Not sure whether it is possible to standardize this field or not.
Yeah, let's normalize the docket number, that's an easy fix.
For nature of suit, how about: If there is a value in the field, don't update it. It never changes.
That'd take care of this source of issues, right?
That'd take care of this source of issues, right?
Yeah, I've already create the related issues for these fixes so you can prioritize them accordingly.
@mlissner After we solved the ConnectionErrors on Elastic UBQ requests and some issues related to data integrity. We have gotten a few ConnectionTimeout and ConflictErrors.
Before deciding to increase the request timeout and adding
ConnectionTimeout
andConflictErrors
to the new retry policy, which spreads the retries according to the volume of documents to update, I reviewed all the errors for each issue. This was to confirm that the requests triggering these errors did not involve data integrity issues, similar to those we had detected previously, where updates from different sources were triggered when the data varied.COURTLISTENER-5V4 COURTLISTENER-5V5
These are my findings, where the events seem to be data integrity issues.
Pendergrass v. Raffensperger (N.D. Ga. 2021)
The issue here is that the case query report parses the case name as:
Pendergrass et al v. Raffensperger
While the Docket report as:
Pendergrass v. Raffensperger - Restricted Filer Robert Allensworth, see Order 244
61688278_case_query_gand.txt 61688278_docket_gand.txt
In this case, which version name would be correct?
The Bankruptcy Link - Adversary Proceeding (Bankr. S.D.N.Y. 2008)
The parsed
case_name
from case query is:Securities Investor Protection Corporation v. Bernard L. Madoff Investment Securities, LLC. et al
While in the docket report is parsed as :
The Bankruptcy Link - Adversary Proceeding
4326756_case_query_nysb.txt 4326756_docket_nysb.txt
I'm not sure if it's possible in this case to obtain a standardized name from both sources, so that uploads from these two sources don't trigger unnecessary update tasks.
The Litigation Practice Group P.C. (Bankr. C.D. Cal. 2023)
The problem here is the
assigned_to_str
field:From the docket report:
"assigned_to_str": "Scott C. Clarkson",
From the case query page:
"assigned_to_str": "Scott C Clarkson”
In both sources, the value is
Scott C Clarkson
but the parser in the docket report adds a.
afterC
In this case, what would be the correct value?67051161_docket_cacb.txt 67051161_case_query_cacb.txt
The Litigation Practice Group P.C. - Adversary Proceeding (Bankr. C.D. Cal. 2023)
This case is a combination of previous errors, the
assigned_to_str
and thecase_name
changed.From docket report:
From case query:
67487044_case_query_cacb.txt 67487044_docket_cacb.txt
The New York Times Company v. Microsoft Corporation (S.D.N.Y. 2023)
I'm not sure if this one is okay. The docket was updated from:
assigned_to_str: ""
assigned_to_str: "Judge Unassigned"
I'm not sure if
Judge Unassigned
is important in the context of the docket or if it's possible to returnNone
when this string is detected.Also we got an additional event from to this docket, related to the
case_name
. As can be seen in the screenshots, the case name has changed: Docket 1:The New York Times Company v. MICROSOFT CORPORATION
Docket 2:The New York Times Company v. Microsoft Corporation
RSS Feed:The New York Times Company v. Microsoft Corporation
The
case_name
is the same but the letter case changed fromMICROSOFT CORPORATION
toMicrosoft Corporation
, this change is detected by the model field tracker as an update triggering an elastic update.Should we ignore these letter case variations on the tracker field? Or is it correct to trigger the update.
68117049_docket_nysd_2.txt 68117049_docket_nysd_1.txt
Fieldwood Energy LLC and The Official Committee of Unsecured Creditors (Bankr. S.D. Tex. 2020)
This is one is also related to the
case_name
parsing from the docket report and other source:Case name from the docket report:
Fieldwood Energy LLC
Case name from the Claim register:
Fieldwood Energy LLC and The Official Committee of Unsecured Creditors
txsb_17411371_claim.txt txsb_17411371_docket.txt
Sklar Exploration Company, LLC (Bankr.D. Colo. 2020)
Also related to the case_name parsing.
From the RSS Feed:
Sklar Exploration Company, LLC and Sklarco, LLC
From the Docket Report:
Sklar Exploration Company, LLC
cob_17035263_docket.txt
Let me know what you think about these issues, so we can decide what to do in each case, including whether there's something we can address in Juriscraper or if we should ignore some of them in the field tracker.