Open mlissner opened 1 year ago
This is going to be a tough one that's going to have to rely on heuristics, but we can get some easy wins, and at the end of writing this message, I realized how we can find a really great sample of this kind of data. Read on to learn more. :)
The way I began was by looking at a couple different cases. I think the general solution will be to merge docket entries as they're created, if:
That'd work for the example in my last message, above.
Here are some others that I pulled from Ashley v. U.S. Dept of Justice
For some reason .Order
, with the dot in front seems pretty common. I think they're using the dot to alphabetize "order" at the beginning of some internal list they have. So I think if we want to catch and merge these two entries, the above becomes:
Cool. Let's look at another:
Great. The approach still works. Next:
This shows four entries on one day. The first and third merge according to the rules above, as do the second and fourth. Great. Next.
So, notice that they call it ~Util
. I think they're again using punctuation to alphabetize. Nice. Unfortunately, we need a new rule for this since the above won't work, and rescheduling things is very common.
So...
AND
(juriscraper's RSS parser puts that into the entry, actually), then merge, if:
This approach works for the next two duplicates too:
Great. Next:
Dang! "Status Hearing" and "Status Conference" are synonyms, so:
AND
(juriscraper puts that into these entries, actually), thenNext:
Hm, the dates are different. I think this should be fixed via freelawproject/courtlistener#1282.
Here's another two duplicates:
Both should be fine with the approach so far. Next:
I don't know what happened in this one. It looks like we parsed the RSS twice and merged the contents in different order. There should only be one entry here, not three. Perhaps that's a separate bug.
That's it for the first case. Let's look at another and see if our heuristics hold up. Let's use U.S. v. Bankman-Fried. This case is a bit different because the data is mostly merged already anyway, but let's press on:
This one is interesting. It's a typo that the court fixed. I'm not sure we care, but we could try to do something about this, I guess, by comparing edit distance between numberless docket entries, and fixing them if they have very slight changes?
Anyway, here are four more that are properly merged:
I'm pretty sure this case has lots of nicely merged contents because somebody purchased the "Docket History Report," which has the short and long versions of entries together (just make sure to check "Display docket text"):
That report shows you something like this:
Not bad. In fact, I'm realizing we have a LOT of these kinds of reports that have been uploaded to RECAP, and we'd be silly not to use them as test cases. So I guess the above work I just did provides a lot of examples of how minute entries can get merged, and provides a prototype for doing it, but what we should do to fix this is download a couple hundred docket history reports we have saved, and use them to develop the algorithm properly.
That should be much more thorough than what I was doing above. GREAT.
One note on philosophy. As we're doing this, it's important not to have false positives, so conservatively merging is the way to go. We don't want to accidentally merge things we shouldn't, since that could prevent an alert from going out.
I think this issue is an interesting one because it could be the first time that ML is built into CourtListener. So, before we go down that road, there are some big questions that need answering. Among others:
Is ML going to be the right solution here? Can we do as well or better with an old-fashioned approach? Given the numerous questions below, I want to be convinced that ML is better before we invest so much in it.
If ML is the solution, which ML package should we choose as our standard for CourtListener? I think we've played with PyTorch, Apple's thing, Sci-kit Learn, Scipy, etc. We should choose carefully and standardize on one. Bill and Kevin have been using some of these, so we should get their opinions and experience before making a selection.
Once we've selected a library we like, we need to figure out the best practice for building and shipping models:
Do we need a GPU for any of this? It's not crazy to have GPUs around (it adds cost), but if we do, we need to figure out how to make sure our workloads use them.
Maybe it makes more sense to do this stuff with an AWS lambda? I think you can get a GPU in AWS Lambda pretty easily. I don't know.
What about tests. How do we test that the model is working properly and is actually getting better over time?
I think ML has a ton of potential, but the questions above exemplify why I haven't jumped at using it in CourtListener. Before we use it, we have a lot of questions to answer.
Great, I'll be doing a review of the question above and also checking the provided examples. So we could determine if in fact, ML is a viable solution, weighing its upsides and downsides, or if a traditional approach would be more appropriate.
I've been working on trying to meet the objective using a simple heuristic without using ML, so we can evaluate if it's viable to avoid the complexity of implementing ML.
These are the results so far:
I gathered history reports (with short descriptions and long descriptions, I notice most of the history reports stored don't contain long descriptions) so I can use them to evaluate the performance of the heuristic.
The heuristic used does the following:
True
.False
.Some explanations of the previous statements:
This report shows all the entries that couldn't be "merged". failed_entries.txt
From the history reports I only considered minute entries.
These are the global results from my tests, from 514 minute entries, 116 have conflicts to be "merged". That represents a failure rate of 22.56% Total minute entries: 514 Total minute entries with conflict: 116 Global percentage of conflicts: 22.56809338521401 %
Here are the results broken down by docket: Docket number: 1:21-cr-00433 Total minute entries: 40 Minute entries merge conflict: 9 Percentage of conflicts: 22.5%
Docket number: 1:18-cr-00538 Total minute entries: 109 Minute entries merge conflict: 14 Percentage of conflicts: 12.844036697247706 %
Docket number: 0:17-md-02795 Total minute entries: 47 Minute entries merge conflict: 10 Percentage of conflicts: 21.27659574468085 %
Docket number: 5:22-cv-00699 Total minute entries: 3 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %
Docket number: 2:22-cr-20019 Total minute entries: 2 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %
Docket number: 3:21-cv-15310 Total minute entries: 13 Minute entries merge conflict: 2 Percentage of conflicts: 15.384615384615385 %
Docket number: 1:19-cr-00395 Total minute entries: 71 Minute entries merge conflict: 22 Percentage of conflicts: 30.985915492957748 %
Docket number: 3:13-cv-00808 Total minute entries: 82 Minute entries merge conflict: 26 Percentage of conflicts: 31.70731707317073 %
Docket number: 5:23-cv-00160 Total minute entries: 1 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %
Docket number: 1:23-cv-00492 Total minute entries: 3 Minute entries merge conflict: 1 Percentage of conflicts: 33.333333333333336 %
Docket number: 1:22-cv-03203 Total minute entries: 3 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %
Docket number: 0:18-cv-01776 Total minute entries: 1 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %
Docket number: 1:21-cv-02923 Total minute entries: 22 Minute entries merge conflict: 1 Percentage of conflicts: 4.545454545454546 %
Docket number: 1:19-cv-00725 Total minute entries: 1 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %
Docket number: 3:20-cv-07811 Total minute entries: 7 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %
Docket number: 1:22-cv-01602 Total minute entries: 9 Minute entries merge conflict: 2 Percentage of conflicts: 22.22222222222222 %
Docket number: 1:23-mj-02068 Total minute entries: 1 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %
Docket number: 1:23-cv-00287 Total minute entries: 6 Minute entries merge conflict: 5 Percentage of conflicts: 83.33333333333333 %
Docket number: 4:22-mj-00469 Total minute entries: 1 Minute entries merge conflict: 1 Percentage of conflicts: 100.0 %
Docket number: 1:21-cr-00399 Total minute entries: 49 Minute entries merge conflict: 9 Percentage of conflicts: 18.367346938775512 %
Docket number: 1:22-cr-00673 Total minute entries: 18 Minute entries merge conflict: 5 Percentage of conflicts: 27.77777777777778 %
Docket number: 5:22-cv-00752 Total minute entries: 3 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %
Docket number: 1:20-cv-10821 Total minute entries: 20 Minute entries merge conflict: 9 Percentage of conflicts: 45.0 %
Docket number: 5:22-cv-05137 Total minute entries: 2 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %
Next, you can find some common cases I detected as conflicts:
The short description of this one: "Motion to Appoint Counsel" matched with two long descriptions: "ORAL MOTION by Defendant GARY JAMES HARMON to Appoint Counsel. (kk)" this should be the only match. But it also matched other long descriptions from the same date that also contains the same words as the short description: "Oral Motion by Defendant to Appoint Counsel."
{
"docket_number": "1:21-cr-00433",
"court_id": "dcd",
"date_filed": "2021-08-13",
"description":"ORAL MOTION by Defendant GARY JAMES HARMON to Appoint Counsel. (kk)",
"document_number":"None",
"pacer_doc_id":"None",
"short_description":"Motion to Appoint Counsel",
"id":2,
"matched_descriptions":[
"ORAL MOTION by Defendant GARY JAMES HARMON to Appoint Counsel. (kk)",
"Minute Entry for Initial Appearance and Arraignment as to GARY JAMES HARMON held by video before Magistrate Judge Robin M. Meriweather on 8/13/2021 : The defendant agrees to proceed by video for today's hearing. The defendant expressed interest in having his future hearings conducted in-person. Oral Motion by Defendant to Appoint Counsel. The defendant's assets have been frozen. The Court finds that the defendant is eligible for court-appointed counsel and appoints Assistant Federal Public Defender Sabrina Shroff to represent GARY JAMES HARMON. Plea of Not Guilty entered by GARY JAMES HARMON to Counts 1-8, 9 and 10. The Court advised the Government of its due process obligations under Rule 5(f). Status Hearing set before Chief Judge Beryl A. Howell on 8/19/2021 at 1:00 PM by telephonic/VTC. Oral Motion by USA for Temporary Detention (3-day hold request) of the defendant, heard and granted. The detention hearing will be scheduled after defense counsel notifies chambers of her available dates for the hearing. Bond Status of Defendant: Defendant remains committed. Court Reporter: FTR Gold - Ctrm. 7; FTR Time Frame: 2:27:28 - 3:12:54. Defense Attorney: Sabrina Shroff; DOJ Attorney: Alden Pelker standing in for Christopher Brown; Pretrial Officer: Christine Schuck. (kk)"
],
"merge":false
}
Similar example like above, short description "Motion to Appoint Counsel" words are contained in both long descriptions.
{
"docket_number": "1:21-cr-00433",
"court_id": "dcd",
"date_filed": "2021-08-13",
"description":"Minute Entry for Initial Appearance and Arraignment as to GARY JAMES HARMON held by video before Magistrate Judge Robin M. Meriweather on 8/13/2021 : The defendant agrees to proceed by video for today's hearing. The defendant expressed interest in having his future hearings conducted in-person. Oral Motion by Defendant to Appoint Counsel. The defendant's assets have been frozen. The Court finds that the defendant is eligible for court-appointed counsel and appoints Assistant Federal Public Defender Sabrina Shroff to represent GARY JAMES HARMON. Plea of Not Guilty entered by GARY JAMES HARMON to Counts 1-8, 9 and 10. The Court advised the Government of its due process obligations under Rule 5(f). Status Hearing set before Chief Judge Beryl A. Howell on 8/19/2021 at 1:00 PM by telephonic/VTC. Oral Motion by USA for Temporary Detention (3-day hold request) of the defendant, heard and granted. The detention hearing will be scheduled after defense counsel notifies chambers of her available dates for the hearing. Bond Status of Defendant: Defendant remains committed. Court Reporter: FTR Gold - Ctrm. 7; FTR Time Frame: 2:27:28 - 3:12:54. Defense Attorney: Sabrina Shroff; DOJ Attorney: Alden Pelker standing in for Christopher Brown; Pretrial Officer: Christine Schuck. (kk)",
"document_number":"None",
"pacer_doc_id":"None",
"short_description":"Order on Motion to Appoint Counsel",
"id":4,
"matched_descriptions":[
"ORAL MOTION by Defendant GARY JAMES HARMON to Appoint Counsel. (kk)",
"Minute Entry for Initial Appearance and Arraignment as to GARY JAMES HARMON held by video before Magistrate Judge Robin M. Meriweather on 8/13/2021 : The defendant agrees to proceed by video for today's hearing. The defendant expressed interest in having his future hearings conducted in-person. Oral Motion by Defendant to Appoint Counsel. The defendant's assets have been frozen. The Court finds that the defendant is eligible for court-appointed counsel and appoints Assistant Federal Public Defender Sabrina Shroff to represent GARY JAMES HARMON. Plea of Not Guilty entered by GARY JAMES HARMON to Counts 1-8, 9 and 10. The Court advised the Government of its due process obligations under Rule 5(f). Status Hearing set before Chief Judge Beryl A. Howell on 8/19/2021 at 1:00 PM by telephonic/VTC. Oral Motion by USA for Temporary Detention (3-day hold request) of the defendant, heard and granted. The detention hearing will be scheduled after defense counsel notifies chambers of her available dates for the hearing. Bond Status of Defendant: Defendant remains committed. Court Reporter: FTR Gold - Ctrm. 7; FTR Time Frame: 2:27:28 - 3:12:54. Defense Attorney: Sabrina Shroff; DOJ Attorney: Alden Pelker standing in for Christopher Brown; Pretrial Officer: Christine Schuck. (kk)"
],
"merge":false
}
As the previous one, the short description of this one has two matches, this is a more complicated since the short description only is one word "Order", so it's contained in both long descriptions.
{ "docket_number": "1:21-cr-00433", "court_id": "dcd", "date_filed": "2021-08-19", "description":"MINUTE ORDER (paperless), as to GARY JAMES HARMON, ISSUING the following SCHEDULING ORDER: (1) by August 24, 2021, the government shall file its motion for pretrial detention; (2) defendant shall file his opposition by August 30, 2021; (3) the government shall file any reply by September 1, 2021; and (4) the parties shall appear via videoconference at 9:30am on September 9, 2021, for a hearing on the government's motion for pretrial detention. Signed by Chief Judge Beryl A. Howell on August 19, 2021. (lcbah2)", "document_number":"None", "pacer_doc_id":"None", "short_description":"Order", "id":7, "matched_descriptions":[ "Minute Entry for proceedings held before Chief Judge Beryl A. Howell: Status Conference as to GARY JAMES HARMON held via videoconference on 8/19/2021; the Defendant agreed to participate via videoconference after consultation with counsel. A Bond Hearing is scheduled for 9/9/2021, at 9:30 AM before Chief Judge Beryl A. Howell; a briefing scheduling order will be issued by the Court. A further Status Hearing is scheduled for 9/23/2021, at 9:00 AM before Chief Judge Beryl A. Howell. The Defendant agreed to exclude time under the Speedy Trial Act until the next status hearing of 9/23/2021. The Court found that time under the Speedy Trial Act shall be excluded from 8/19/2021 through 9/23/2021, in the interests of justice and those interests outweigh the interests of the public and the defendant in a speedy trial in order to give the parties time to discuss a protective order, give the government time for production of discovery, and the Defendant and his counsel time to review and discuss the discovery. Bond Status of Defendant: Defendant committed. Present via videoconference: Defense Attorney: Sabrina P. Shroff; US Attorneys: Christopher B. Brown and Catherine Pelker. Court Reporter: Elizabeth Saint-Loth. (ztg)", "MINUTE ORDER (paperless), as to GARY JAMES HARMON, ISSUING the following SCHEDULING ORDER: (1) by August 24, 2021, the government shall file its motion for pretrial detention; (2) defendant shall file his opposition by August 30, 2021; (3) the government shall file any reply by September 1, 2021; and (4) the parties shall appear via videoconference at 9:30am on September 9, 2021, for a hearing on the government's motion for pretrial detention. Signed by Chief Judge Beryl A. Howell on August 19, 2021. (lcbah2)" ], "merge":false }
In this case the problem is that the target short description is "Scheduling Order" considering 50% of the words it only needs that one of them appears in the long description, in this case "Order" that is in both of them.
"ORDER granting... " and "SCHEDULING ORDER..."
I was wondering if we consider the order of words conflicts like these could be solved, but it could also affect other matches were the words of the short descriptions are not always in order within the long description.
{
"docket_number": "1:18-cr-00538",
"court_id": "nyed",
"date_filed": "2022-04-28",
"description":"SCHEDULING ORDER as to Ng Chong Hwa. Sentencing set for 9/13/2022 at 10:00 AM in Courtroom 6F North before Chief Judge Margo K. Brodie..Ordered by Chief Judge Margo K. Brodie on 4/28/2022. (Valentin, Winnethka)",
"document_number":"None",
"pacer_doc_id":"None",
"short_description":"Scheduling Order",
"id":9,
"matched_descriptions":[
"ORDER granting [203] Consent MOTION for Extension of Time to File Post-Trial Motions. Ordered by Chief Judge Margo K. Brodie on 4/28/2022. (Valentin, Winnethka)",
"SCHEDULING ORDER as to Ng Chong Hwa. Sentencing set for 9/13/2022 at 10:00 AM in Courtroom 6F North before Chief Judge Margo K. Brodie..Ordered by Chief Judge Margo K. Brodie on 4/28/2022. (Valentin, Winnethka)"
],
"merge":false
}
This is also a clear conflict since both long descriptions matched contains the short description "Status Conference".
{ "docket_number": "1:18-cr-00538", "court_id": "nyed", "date_filed": "2022-02-25", "description":"MINUTE ENTRY: Status conference by telephone was held before Chief Judge Margo K. Brodie on February 25, 2022. AUSAs Drew Rolle, Alixandra Smith, and Dylan Stern, and Brent Wible and Jennifer Ambuehl of the Department of Justice, appeared on behalf of the government. Retained counsel Marc Agnifilo, Teny Geragos, Jacob Kaplan, and Zach Intrater appeared on behalf of Defendant Ng Chong Hwa. Henry Mazurek and Ilana Haramati appeared on behalf of Mr. Lessiner. The jury trial will continue on March 1, 2022, at 9:30 AM in Courtroom 4F North before Chief Judge Margo K. Brodie. The Court scheduled a status conference for February 28, 2022, at 10:00 AM. The parties will notify the Court if a status conference is needed by February 27, 2022. The call-in information for the telephone conference is 1-888-684-8852 and the access code is 9801036. If any party has any difficulty accessing the telephone, please call chambers at (718) 613-2140. Persons granted remote access to proceedings are reminded of the general prohibition against photographing, recording, and rebroadcasting of court proceedings. Violation of these prohibitions may result in sanctions, including removal of court-issued media credentials, restricted entry to future hearings, denial of entry to future hearings, or any other sanctions deemed necessary by the Court. (Court Reporter Anthony Frisilone.) (Valentin, Winnethka)", "document_number":"None", "pacer_doc_id":"None", "short_description":"Status Conference", "id":39, "matched_descriptions":[ "SCHEDULING ORDER as to Ng Chong Hwa. The Court scheduled a status conference for February 25, 2022, at 3:00 PM. The call-in information for the telephone conference is 1-888-684-8852 and the access code is 9801036. If any party has any difficulty accessing the telephone, please call chambers at (718) 613-2140. Persons granted remote access to proceedings are reminded of the general prohibition against photographing, recording, and rebroadcasting of court proceedings. Violation of these prohibitions may result in sanctions, including removal of court-issued media credentials, restricted entry to future hearings, denial of entry to future hearings, or any other sanctions deemed necessary by the Court. Ordered by Chief Judge Margo K. Brodie on 2/25/2022. (Valentin, Winnethka)", "MINUTE ENTRY: Status conference by telephone was held before Chief Judge Margo K. Brodie on February 25, 2022. AUSAs Drew Rolle, Alixandra Smith, and Dylan Stern, and Brent Wible and Jennifer Ambuehl of the Department of Justice, appeared on behalf of the government. Retained counsel Marc Agnifilo, Teny Geragos, Jacob Kaplan, and Zach Intrater appeared on behalf of Defendant Ng Chong Hwa. Henry Mazurek and Ilana Haramati appeared on behalf of Mr. Lessiner. The jury trial will continue on March 1, 2022, at 9:30 AM in Courtroom 4F North before Chief Judge Margo K. Brodie. The Court scheduled a status conference for February 28, 2022, at 10:00 AM. The parties will notify the Court if a status conference is needed by February 27, 2022. The call-in information for the telephone conference is 1-888-684-8852 and the access code is 9801036. If any party has any difficulty accessing the telephone, please call chambers at (718) 613-2140. Persons granted remote access to proceedings are reminded of the general prohibition against photographing, recording, and rebroadcasting of court proceedings. Violation of these prohibitions may result in sanctions, including removal of court-issued media credentials, restricted entry to future hearings, denial of entry to future hearings, or any other sanctions deemed necessary by the Court. (Court Reporter Anthony Frisilone.) (Valentin, Winnethka)" ], "merge":false }
There are more similar examples where one of the docket entries from the same date can't be properly matched due to more than one long description matched. I think these problems might be also present if we use a ML approach, don't you think?
The match failure is rate is still far away from the desired 10%, however most of the conflicts are related to docket entries that are matched by two or more long descriptions, I think in production if we add an additional check that if a docket entry was already merged with its long description, we'll know that long description shouldn't match with any other short description. But this will depend in the order of long descriptions come in, or if they come in together.
Let me know what you think.
Well, this is a tricky problem to be sure. I looked at a bunch of them and tried to come up with some thoughts.
A few observations:
It seems like a lot of cases might be fixed if you compare the percentages matched. For example, imagine if two entries match a short description on the same day, and one matches 100% of tokens while the other matches 50%. In this case, I think the better match is probably the one to merge.
There are two kinds of errors when merging:
Some can't be merged correctly even by a human.
Some words are more important than others, and there could be opportunities for synonyms. For example, Minute Entry == Order. "Motion" is an important word when matching.
I agree that word order could be helpful with a lot of these, but perhaps not all.
In many cases, it's very clear to me, as a human, which should get merged because I can understand the words and know which have a high likelihood of changing the meaning of the short or long description. This hints that ML might work well.
I think your current method of looking for 50% seems crude since some words have more value than others.
Sometimes it's helpful to have a rule that takes care of the easy cases and another to deal with the rest.
For your stats can you clarify what the 22.56% refers to? Is that how many will fail globally or how many of the duplicates will fail?
Thank you for your comments and suggestions, some additional comments and ideas:
- It seems like a lot of cases might be fixed if you compare the percentages matched. For example, imagine if two entries match a short description on the same day, and one matches 100% of tokens while the other match 50%. In this case, I think the better match is probably the one to merge.
Yeah, I was thinking something like this. However, for it to work, we would need to have the two or more long descriptions available to merge at the same time. This way, we can compare which one has the highest percentage. This might not work if, for example, we have two short descriptions for a date, and then a long description comes, for instance, via recap.email
, which only contains one entry. In that case, we'll only have one long description to compare with the two possible short descriptions to merge in. I think this problem would be less common via the recap extension since they usually have many docket entries that could include all the entries available for a date (but it's still possible).
There are two kinds of errors when merging:
First: Merges that are wrong in meaning. For example, if you merge a "scheduling order" with a non-scheduling order, that's bad.
Got it. I think using the percentage approach when having all the long description entries available to merge simultaneously might help solve this kind of wrong merge. Alternatively, here could also work to rate words according to their importance. For example, in this case, "non-scheduling" could have a higher weight than "order," so the entry that contains "non-scheduling" would be merged.
Second: Merges that are just technically wrong. For example, if you have two long descriptions that are orders, and you merge a short description of "Order" with the wrong one. It doesn't really matter. You merged "Order" with an order. That's OK.
Yeah, this is correct. If there are two "Order" sections that are short, it doesn't matter which one is merged first. In the end, they will belong on the same date.
Some words are more important than others, and there could be opportunities for synonyms. For example, Minute Entry == Order. "Motion" is an important word when matching.
In order to check which words are more important than others, I think we can extract the most common words from minute entries and set a weight to these words. What do you think?
In many cases, it's very clear to me, as a human, which should get merged because I can understand the words and know which have a high likelihood of changing the meaning of the short or long description. This hints that ML might work well.
Does this mean we should do a test using ML so we could compare how well it performs and compare with heuristics? Or first, continue working on the heuristic?
Sometimes it's helpful to have a rule that takes care of the easy cases and another to deal with the rest.
Sure I’ll keep this in mind so we could use a rule for the simple ones and a different approach for edge cases.
For your stats can you clarify what the 22.56% refers to? Is that how many will fail globally or how many of the duplicates will fail?
Yeah, this percentage refers to how many minute entries will fail globally (according to my sample) to be merged, including duplicates and non-duplicate minute entries.
- Finally, I was thinking that in order to get a better sample, could we find more docket history reports that we have stored? I only used 25 dockets because it was difficult to find them manually since most of them don't contain long descriptions. So I could write a command that looks for docket history reports with long descriptions and returns the paths to download them so that we could have a bigger sample (maybe 200 docket reports)?
I was able to bend JHawk's ear a bit about this issue in the #pacer channel today. He had some very good ideas, but the general concept is that we have another signal we can use to sort this out, the pacer_sequence_number
. We get and store these for all RSS entries and we get them for any item with a document link on the docket sheet.
So what we can do is:
I think that should work?
A few other thoughts from Slack:
The one issue that JHawk pointed out for this is that the pacer_sequence_number is created when the item is Entered
into PACER, not when it is Filed
by the attorney. Our docket sheets are usually ordered by date_filed, not date_entered, so there could be some difference here.
Luckily, for numberless entries like these, which are created by the court, the filed and entered should be the same.
I could write a command that looks for docket history reports with long descriptions and returns the paths to download them
If this is still useful, want to just give me some code and I can run it?
Does this mean we should do a test using ML?
Sounds like we won't need it!
Docket History reports are always complete, so when we get one of those, we could use it to clean up the docket and merge things properly. If we do that, we can be sure that Docket History reports never create duplicates, which is at least one thing fixed. I think this would be tricky to get right, so we probably shouldn't do it, but it's a thought I had.
It's weird that this fits the 80/20 rule almost exactly. We get 78% easily and 22% is going to be a pain. Weird.
@mlissner thanks, I'll be reviewing this new approach and doing some testing. I'm just afraid that for appellate RSS feeds we don't have the pacer_sequence_number
, so we might need to still use some heuristics.
Yeah, having a bigger sample of history reports will be helpful, here is some code to get only DocketHistory reports with long descriptions.
from juriscraper.pacer import DocketHistoryReport
from cl.recap.models import PacerHtmlFiles, UPLOAD_TYPE
history_reports = PacerHtmlFiles.objects.filter(upload_type=UPLOAD_TYPE.DOCKET_HISTORY_REPORT).order_by("-date_created")
long_des_hr = []
for hr in history_reports.iterator():
text = hr.filepath.read().decode()
report = DocketHistoryReport("default")
report._parse_text(text)
data = report.data
docket_entries = data["docket_entries"]
for i, de in enumerate(docket_entries):
if de["short_description"] and de["description"]:
long_des_hr.append(hr)
break
if i > 2:
break
if len(long_des_hr) >=200:
break
print("Docket history reports with long descriptions:")
for report in long_des_hr:
print(f"ID: {report.pk} URL: {report.filepath}")
It works by looking for DOCKET_HISTORY_REPORT PacerHtmlFiles files and checking if their docket entries contain short and long descriptions, we'll only check a couple of entries for each file in case the first one is not representative.
The script will end once we have 200 DocketHistory reports with long descriptions, so it might take a while.
I'm running the script now. Good point about appellate, but we can worry about that next. Let's get district taken care of and then deal with corner cases in a second pass.
Here are the results, collapsed to keep our conversation smaller:
I dug into the new proposed approach of using the pacer_sequence_number
to sort short descriptions for a proper merging of long descriptions. However, as you mentioned, numberless entries do not have a pacer_sequence_number
, and since our problem is specifically related to such entries, this approach cannot be used.
As you also mentioned, we could use dates and times to sort the short descriptions and then merge them. However, I discovered a couple of issues while checking this:
- It appears that there is a bug when parsing and merging entries that have the same metadata but different descriptions.
e.g: https://www.courtlistener.com/docket/60109244/united-states-v-slaeker/
If you look at this docket you'll find two groups of entries where each group should be only one entry.
The problem is that the descriptions of these entries are the same. But they are sorted differently on each entry. When the feed was parsed by Juriscraper the entries with the same metadata and different descriptions are join by appending an "AND" but the problem is that sometimes the order of these entries varies according to the time the feed was parsed: e.g: dcd-08-13-2021-17-21-03.txt dcd-08-14-2021-14-43-31.txt
If you open the RSS feed dcd-08-13-2021-17-21-03.txt and search for Fri, 13 Aug 2021 14:43:06 GMT
You'll see the following group of entries: 1:21-mj-00545-1 USA v. SLAEKER:
[Speedy Trial - Excludable Start] [~Util - Set/Reset Hearings] [Order on Motion to Exclude] [Initial Appearance]
Description after being merged by juriscraper:
Speedy Trial - Excludable Start AND ~Util - Set/Reset Hearings AND Order on Motion to Exclude AND Initial Appearance
Then if you open the next file which is the feed for the same court but many hours after (also search for Fri, 13 Aug 2021 14:43:06 GMT
), you'll see the same group of entries but in a different order:
1:21-mj-00545-1 USA v. SLAEKER
[~Util - Set/Reset Hearings] [Speedy Trial - Excludable Start] [Initial Appearance] [Order on Motion to Exclude]
As you can see, the order changes, so entries are merged by Juriscraper in a different order:
~Util - Set/Reset Hearings AND Speedy Trial - Excludable Start AND Initial Appearance AND Order on Motion to Exclude.
Since these are numberless entries, the only way to detect the same entry when merging them in CL, is the description but as you could see descriptions changes due to the order, so the previous entry is not found and a new one is added.
So, this problem is causing duplicate entries, and for the same reason, it could affect the approach of ordering short descriptions on the same day. Therefore, we could merge their long descriptions properly.
I think this problem might be solved first. We could do something like in CL: if there is an existing docket entry on the same day and the short description contains "AND," split the descriptions into parts, and then compare part by part instead of the whole short description. If all the parts match, we could say it is the same entry.
- In another hand, here is an example here about the difference between the date filed and the date entered that you mentioned:
If you look at the file:
And search for 1:21-mj-00545-1
You'll see 11 entries, two groups of 4 entries with the same date, like the one described before, and 3 other entries with different times and descriptions.
1:21-mj-00545-1 USA v. SLAEKER
Appoint Counsel Thu, 12 Aug 2021 18:02:47 GMT
Exclude Thu, 12 Aug 2021 18:05:58 GMT
Order on Motion to Appoint Counsel AND Speedy Trial - Excludable Start AND Order on Motion to Exclude AND Initial Appearance Thu, 12 Aug 2021 22:47:41 GMT
Exclude Fri, 13 Aug 2021 14:34:39 GMT
Speedy Trial - Excludable Start AND ~Util - Set/Reset Hearings AND Order on Motion to Exclude AND Initial Appearance Fri, 13 Aug 2021 14:43:06 GMT
And if you look at the docket or docket history report, the date filed for some of these entries is 08/09/2021 and 08/12/2021:
But seems that the order of these entries as they appear in the feed (the date_filed we use) matches the order of the docket or docket history report. I just want to be sure that's true, could you help me to confirm it? for example, the first entry's short description is Appoint Counsel
and in the docket history report it says Motion to Appoint Counsel
or the second one says Exclude
and into the docket history report says Motion to Exclude
, are they the same?
If we can confirm the order of the entries as they appear in the feed is the same that the order of entries in the docket sheet, I think we'll be able to use the date_filed
and time_filed
or the recap_sequence_number
to order short descriptions before merging the long descriptions (I checked other examples and this seems to be true, just this is a bit tricky).
- An additional point:
Before adding time_filed
to docket entries and parsing the timezone from RSS feeds, the recap_sequence_number
had the format 2021-08-12T18:05:58+00:00.001
. After adding time_filed,
the recap_sequence_number
is computed based only on the date_filed
, using the same format we use when adding entries from dockets: 2021-08-11.001.
However, I noticed a possible issue when adding multiple entries for a docket from an RSS feed on the same day. The recap_sequence_number
is the same for all the entries added on that day, because different entries for the same docket are reported independently. Therefore, when computing the recap_sequence_number
, the counter remains the same.
A couple of possible solutions for this might be:
To ensure that the recap_sequence_number can be properly computed when added into CL, we need to group docket entries when parsing the RSS feed in Juriscraper. This means that if there is more than one entry for the same docket, they should be added within the same docket, so that the docket_entries
key will have multiple entries instead of one. However, this approach may require an additional tweak. If more entries for the same day are added afterward, they will start from recap_sequence_number counter 1. To address this, we may need to query the last entry on the database for that day so that the next recap_sequence_number
counter can continue from the previous one.
Alternatively, we could go back to using both the date and time to compute the recap_sequence_number
. This may be the easiest solution, as using the date and time will result in different recap_sequence_number
even when the counter is the same (1).
So in a brief:
recap_sequence_number
for docket entries added by RSS feed is the same for entries added on the same day. We can fix this issue using one of the solutions suggested above.pacer_sequence_number
since minute entries don't get this field. Instead, we can use the date_filed
and time_filed
(for entries where available) or the recap_sequence_number
(for entries that don't have time_filed
) to order short descriptions. This is because the date_filed
(and time_filed) for entries added via an RSS feed is the date the entry was published in the feed. The order in which entries appear in the feed seems to be the same as the order in which they appear in the docket sheet, but we need to confirm this is true. There are some tricky examples, such as the one described above, where the order seems a bit confusing.Let me know what you think.
Man, Alberto, you're in the guts of things now! Good research. A few replies, since there's a lot to think about here:
The problem is that the descriptions of these entries are the same. But they are sorted differently on each entry.
This is annoying. It must be their database ordering things in arbitrary order when the dates are identical. There's a simple solution to this. Let's update Juriscraper to order these items alphabetically regardless of the order they appear in the feed, if they have the same dates.
You suggested that we:
if there is an existing docket entry on the same day and the short description contains "AND," split the descriptions into parts, and then compare part by part instead of the whole short description. If all the parts match, we could say it is the same entry.
But I think that won't be necessary if we handle it in Juriscraper.
You asked:
But [it] seems that the order of these entries as they appear in the feed (the date_filed we use) matches the order of the docket or docket history report. I just want to be sure that's true, could you help me to confirm it?
This is safe to assume, but with two caveats:
recap_sequence_number
need tweaks?You observed:
After adding time_filed, the recap_sequence_number is computed based only on the date_filed, using the same format we use when adding entries from dockets: 2021-08-11.001.
I'm assuming we did this because we don't get the time value when we get docket reports, so including the time in some sequence numbers but not all of them was problematic? If so, I don't think we can continue adding the time to this field.
Of course, if adding the time back to this field doesn't cause trouble, it does seem like a simple solution though!
Your other, more complicated approach of
the docket_entries key [having] multiple entries instead of one.
And:
[querying] the last entry on the database for that day so that the next recap_sequence_number counter can continue from the previous one.
Makes sense to me, even though it seems like a pain! :)
Thanks for raising these tricky issues. I'm glad to see you wrestling with this beast.
Great, thanks for your answers!
Yeah, ordering descriptions in Juriscraper will be simpler, I'll work on it.
About recap_sequence_number
yes we removed the time from it to standardize the sequence numbers, but I don't remember if was a problem related to it, I'll confirm it.
This is a problem in CourtListener that we've never tried very hard to solve, but maybe we should.
In many cases, we get the short description of a minute entry and the long description. Since they have no identifier, we have no way of merging them properly:
And we wind up with two docket entries, as above. It's not great and, worse, it means we send two alerts for the same thing.
I think we might be able to merge these, at least in some cases.