Double posts when we get short minute entries

This is a problem in CourtListener that we've never tried very hard to solve, but maybe we should.

In many cases, we get the short description of a minute entry and the long description. Since they have no identifier, we have no way of merging them properly:

And we wind up with two docket entries, as above. It's not great and, worse, it means we send two alerts for the same thing.

I think we might be able to merge these, at least in some cases.

This is going to be a tough one that's going to have to rely on heuristics, but we can get some easy wins, and at the end of writing this message, I realized how we can find a really great sample of this kind of data. Read on to learn more. :)

The way I began was by looking at a couple different cases. I think the general solution will be to merge docket entries as they're created, if:

They lack entry numbers.
The shorter description is fully contained in the longer one.
The date is the same.

That'd work for the example in my last message, above.

Here are some others that I pulled from Ashley v. U.S. Dept of Justice

For some reason .Order, with the dot in front seems pretty common. I think they're using the dot to alphabetize "order" at the beginning of some internal list they have. So I think if we want to catch and merge these two entries, the above becomes:

Pre-process short descriptions by stripping any periods on the left, then merge, if:
1. They lack entry numbers.
2. The shorter description is fully contained in the longer one.
3. The date is the same.

Cool. Let's look at another:

Great. The approach still works. Next:

This shows four entries on one day. The first and third merge according to the rules above, as do the second and fourth. Great. Next.

So, notice that they call it ~Util. I think they're again using punctuation to alphabetize. Nice. Unfortunately, we need a new rule for this since the above won't work, and rescheduling things is very common.

So...

Pre-process short descriptions by stripping any periods on the left, then
Split on uppercase AND (juriscraper's RSS parser puts that into the entry, actually), then merge, if:
1. They lack entry numbers.
2. The date is the same.
3. Any of the split parts or the shorter description is fully contained in the longer one.

This approach works for the next two duplicates too:

Great. Next:

Dang! "Status Hearing" and "Status Conference" are synonyms, so:

Pre-process short descriptions by stripping any periods on the left, then
Split on uppercase AND (juriscraper puts that into these entries, actually), then
Normalize terms like "status conference" and "status hearing", then merge, if:
1. They lack entry numbers.
2. The date is the same.
3. Any of the split parts or the shorter description is fully contained in the longer one.

Hm, the dates are different. I think this should be fixed via freelawproject/courtlistener#1282.

Here's another two duplicates:

Both should be fine with the approach so far. Next:

I don't know what happened in this one. It looks like we parsed the RSS twice and merged the contents in different order. There should only be one entry here, not three. Perhaps that's a separate bug.

That's it for the first case. Let's look at another and see if our heuristics hold up. Let's use U.S. v. Bankman-Fried. This case is a bit different because the data is mostly merged already anyway, but let's press on:

This one is interesting. It's a typo that the court fixed. I'm not sure we care, but we could try to do something about this, I guess, by comparing edit distance between numberless docket entries, and fixing them if they have very slight changes?

Anyway, here are four more that are properly merged:

I'm pretty sure this case has lots of nicely merged contents because somebody purchased the "Docket History Report," which has the short and long versions of entries together (just make sure to check "Display docket text"):

That report shows you something like this:

Not bad. In fact, I'm realizing we have a LOT of these kinds of reports that have been uploaded to RECAP, and we'd be silly not to use them as test cases. So I guess the above work I just did provides a lot of examples of how minute entries can get merged, and provides a prototype for doing it, but what we should do to fix this is download a couple hundred docket history reports we have saved, and use them to develop the algorithm properly.

That should be much more thorough than what I was doing above. GREAT.

One note on philosophy. As we're doing this, it's important not to have false positives, so conservatively merging is the way to go. We don't want to accidentally merge things we shouldn't, since that could prevent an alert from going out.

I think this issue is an interesting one because it could be the first time that ML is built into CourtListener. So, before we go down that road, there are some big questions that need answering. Among others:

Is ML going to be the right solution here? Can we do as well or better with an old-fashioned approach? Given the numerous questions below, I want to be convinced that ML is better before we invest so much in it.
If ML is the solution, which ML package should we choose as our standard for CourtListener? I think we've played with PyTorch, Apple's thing, Sci-kit Learn, Scipy, etc. We should choose carefully and standardize on one. Bill and Kevin have been using some of these, so we should get their opinions and experience before making a selection.
Once we've selected a library we like, we need to figure out the best practice for building and shipping models:
- How do we document our work so we can rebuild a model later with more data or the latest algorithms?
- How do we ship a model? Do we put it in Git? Do we put it in S3? A one-time creation step that lives on our disk? How big is a model in terms of bytes?
- How do we load the model efficiently? If it's several MB and it lives on disk, presumably it needs to go into RAM before we can use it. It'd be nice not to load big files into RAM every time we start the Python shell.
Do we need a GPU for any of this? It's not crazy to have GPUs around (it adds cost), but if we do, we need to figure out how to make sure our workloads use them.
Maybe it makes more sense to do this stuff with an AWS lambda? I think you can get a GPU in AWS Lambda pretty easily. I don't know.
What about tests. How do we test that the model is working properly and is actually getting better over time?

I think ML has a ton of potential, but the questions above exemplify why I haven't jumped at using it in CourtListener. Before we use it, we have a lot of questions to answer.

Great, I'll be doing a review of the question above and also checking the provided examples. So we could determine if in fact, ML is a viable solution, weighing its upsides and downsides, or if a traditional approach would be more appropriate.

I've been working on trying to meet the objective using a simple heuristic without using ML, so we can evaluate if it's viable to avoid the complexity of implementing ML.

These are the results so far:

I gathered history reports (with short descriptions and long descriptions, I notice most of the history reports stored don't contain long descriptions) so I can use them to evaluate the performance of the heuristic.

The heuristic used does the following:

Get minute entries from the same date.
Clean the short description removing chars like: ".", "<", "*", "/", "~", "(", ")"
Split the short description, so a list of words is returned.
Remove common words like: "of", "to", "on"
For each docket entry of the same date do the following:
- Count the number of unique words from the short description contained in the long description.
- Check if a long description should match the short description of an entry, considering the long description contains at least 50% of the short description words, if so it's a match
- If the short description contains an "AND" the percentage is reduced 50% more since they are two different short descriptions, if so it's a match.
- After finishing finding matches for each long description, there is an additional check that confirms that every entry from the same date was matched only by one long description, if so the "merge" is considered True.
- If a docket entry was matched by more than one long description the "merge" is considered False.

Some explanations of the previous statements:

Removing chars like ".", "<", "*", "/", "~", "(", ")" help us to detect as equal words that have one of these chars attached like: .Order, <<<Motion, Set/Reset, (Count reporter), ~Util
The threshold of 50% of the short descriptions words contained in the long descriptions to consider a match was needed due to I detected the short descriptions are not always fully contained in the long ones. So using an approach fully contained the global percentage of failure increases to 46% and using a lower one (1/3) increases to 25.87% vs the 22.56% reported below.
Due to there are words pretty common in short and long descriptions like "of", "to", "on", in order to these words don't affect the percentage of contained words, they are removed before. More work to detect other common words might be required here.
The reason for considering merge failure entries that have more than one long description a match is to avoid false positives so if we can't certainly detect to which entry long descriptions should be merged, is better to don't merge them.

This report shows all the entries that couldn't be "merged". failed_entries.txt

From the history reports I only considered minute entries.

These are the global results from my tests, from 514 minute entries, 116 have conflicts to be "merged". That represents a failure rate of 22.56% Total minute entries: 514 Total minute entries with conflict: 116 Global percentage of conflicts: 22.56809338521401 %

Here are the results broken down by docket: Docket number: 1:21-cr-00433 Total minute entries: 40 Minute entries merge conflict: 9 Percentage of conflicts: 22.5%

Docket number: 1:18-cr-00538 Total minute entries: 109 Minute entries merge conflict: 14 Percentage of conflicts: 12.844036697247706 %

Docket number: 0:17-md-02795 Total minute entries: 47 Minute entries merge conflict: 10 Percentage of conflicts: 21.27659574468085 %

Docket number: 5:22-cv-00699 Total minute entries: 3 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %

Docket number: 2:22-cr-20019 Total minute entries: 2 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %

Docket number: 3:21-cv-15310 Total minute entries: 13 Minute entries merge conflict: 2 Percentage of conflicts: 15.384615384615385 %

Docket number: 1:19-cr-00395 Total minute entries: 71 Minute entries merge conflict: 22 Percentage of conflicts: 30.985915492957748 %

Docket number: 3:13-cv-00808 Total minute entries: 82 Minute entries merge conflict: 26 Percentage of conflicts: 31.70731707317073 %

Docket number: 5:23-cv-00160 Total minute entries: 1 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %

Docket number: 1:23-cv-00492 Total minute entries: 3 Minute entries merge conflict: 1 Percentage of conflicts: 33.333333333333336 %

Docket number: 1:22-cv-03203 Total minute entries: 3 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %

Docket number: 0:18-cv-01776 Total minute entries: 1 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %

Docket number: 1:21-cv-02923 Total minute entries: 22 Minute entries merge conflict: 1 Percentage of conflicts: 4.545454545454546 %

Docket number: 1:19-cv-00725 Total minute entries: 1 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %

Docket number: 3:20-cv-07811 Total minute entries: 7 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %

Docket number: 1:22-cv-01602 Total minute entries: 9 Minute entries merge conflict: 2 Percentage of conflicts: 22.22222222222222 %

Docket number: 1:23-mj-02068 Total minute entries: 1 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %

Docket number: 1:23-cv-00287 Total minute entries: 6 Minute entries merge conflict: 5 Percentage of conflicts: 83.33333333333333 %

Docket number: 4:22-mj-00469 Total minute entries: 1 Minute entries merge conflict: 1 Percentage of conflicts: 100.0 %

Docket number: 1:21-cr-00399 Total minute entries: 49 Minute entries merge conflict: 9 Percentage of conflicts: 18.367346938775512 %

Docket number: 1:22-cr-00673 Total minute entries: 18 Minute entries merge conflict: 5 Percentage of conflicts: 27.77777777777778 %

Docket number: 5:22-cv-00752 Total minute entries: 3 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %

Docket number: 1:20-cv-10821 Total minute entries: 20 Minute entries merge conflict: 9 Percentage of conflicts: 45.0 %

Docket number: 5:22-cv-05137 Total minute entries: 2 Minute entries merge conflict: 0 Percentage of conflicts: 0.0 %

Next, you can find some common cases I detected as conflicts:

The short description of this one: "Motion to Appoint Counsel" matched with two long descriptions: "ORAL MOTION by Defendant GARY JAMES HARMON to Appoint Counsel. (kk)" this should be the only match. But it also matched other long descriptions from the same date that also contains the same words as the short description: "Oral Motion by Defendant to Appoint Counsel."

{
 "docket_number": "1:21-cr-00433",
 "court_id": "dcd",
 "date_filed": "2021-08-13",
 "description":"ORAL MOTION by Defendant GARY JAMES HARMON to Appoint Counsel. (kk)",
 "document_number":"None",
 "pacer_doc_id":"None",
 "short_description":"Motion to Appoint Counsel",
 "id":2,
 "matched_descriptions":[
    "ORAL MOTION by Defendant GARY JAMES HARMON to Appoint Counsel. (kk)",
    "Minute Entry for Initial Appearance and Arraignment as to GARY JAMES HARMON held by video before Magistrate Judge Robin M. Meriweather on 8/13/2021 : The defendant agrees to proceed by video for today's hearing. The defendant expressed interest in having his future hearings conducted in-person. Oral Motion by Defendant to Appoint Counsel. The defendant's assets have been frozen. The Court finds that the defendant is eligible for court-appointed counsel and appoints Assistant Federal Public Defender Sabrina Shroff to represent GARY JAMES HARMON. Plea of Not Guilty entered by GARY JAMES HARMON to Counts 1-8, 9 and 10. The Court advised the Government of its due process obligations under Rule 5(f). Status Hearing set before Chief Judge Beryl A. Howell on 8/19/2021 at 1:00 PM by telephonic/VTC. Oral Motion by USA for Temporary Detention (3-day hold request) of the defendant, heard and granted. The detention hearing will be scheduled after defense counsel notifies chambers of her available dates for the hearing. Bond Status of Defendant: Defendant remains committed. Court Reporter: FTR Gold - Ctrm. 7; FTR Time Frame: 2:27:28 - 3:12:54. Defense Attorney: Sabrina Shroff; DOJ Attorney: Alden Pelker standing in for Christopher Brown; Pretrial Officer: Christine Schuck. (kk)"
 ],
 "merge":false
}

Similar example like above, short description "Motion to Appoint Counsel" words are contained in both long descriptions.


{
"docket_number": "1:21-cr-00433",
"court_id": "dcd",
"date_filed": "2021-08-13",
"description":"Minute Entry for Initial Appearance and Arraignment as to GARY JAMES HARMON held by video before Magistrate Judge Robin M. Meriweather on 8/13/2021 : The defendant agrees to proceed by video for today's hearing. The defendant expressed interest in having his future hearings conducted in-person. Oral Motion by Defendant to Appoint Counsel. The defendant's assets have been frozen. The Court finds that the defendant is eligible for court-appointed counsel and appoints Assistant Federal Public Defender Sabrina Shroff to represent GARY JAMES HARMON. Plea of Not Guilty entered by GARY JAMES HARMON to Counts 1-8, 9 and 10. The Court advised the Government of its due process obligations under Rule 5(f). Status Hearing set before Chief Judge Beryl A. Howell on 8/19/2021 at 1:00 PM by telephonic/VTC. Oral Motion by USA for Temporary Detention (3-day hold request) of the defendant, heard and granted. The detention hearing will be scheduled after defense counsel notifies chambers of her available dates for the hearing. Bond Status of Defendant: Defendant remains committed. Court Reporter: FTR Gold - Ctrm. 7; FTR Time Frame: 2:27:28 - 3:12:54. Defense Attorney: Sabrina Shroff; DOJ Attorney: Alden Pelker standing in for Christopher Brown; Pretrial Officer: Christine Schuck. (kk)",
"document_number":"None",
"pacer_doc_id":"None",
"short_description":"Order on Motion to Appoint Counsel",
"id":4,
"matched_descriptions":[
  "ORAL MOTION by Defendant GARY JAMES HARMON to Appoint Counsel. (kk)",
  "Minute Entry for Initial Appearance and Arraignment as to GARY JAMES HARMON held by video before Magistrate Judge Robin M. Meriweather on 8/13/2021 : The defendant agrees to proceed by video for today's hearing. The defendant expressed interest in having his future hearings conducted in-person. Oral Motion by Defendant to Appoint Counsel. The defendant's assets have been frozen. The Court finds that the defendant is eligible for court-appointed counsel and appoints Assistant Federal Public Defender Sabrina Shroff to represent GARY JAMES HARMON. Plea of Not Guilty entered by GARY JAMES HARMON to Counts 1-8, 9 and 10. The Court advised the Government of its due process obligations under Rule 5(f). Status Hearing set before Chief Judge Beryl A. Howell on 8/19/2021 at 1:00 PM by telephonic/VTC. Oral Motion by USA for Temporary Detention (3-day hold request) of the defendant, heard and granted. The detention hearing will be scheduled after defense counsel notifies chambers of her available dates for the hearing. Bond Status of Defendant: Defendant remains committed. Court Reporter: FTR Gold - Ctrm. 7; FTR Time Frame: 2:27:28 - 3:12:54. Defense Attorney: Sabrina Shroff; DOJ Attorney: Alden Pelker standing in for Christopher Brown; Pretrial Officer: Christine Schuck. (kk)"
],
"merge":false
}


As the previous one, the short description of this one has two matches, this is a more complicated since the short description only is one word "Order", so it's contained in both long descriptions.

{ "docket_number": "1:21-cr-00433", "court_id": "dcd", "date_filed": "2021-08-19", "description":"MINUTE ORDER (paperless), as to GARY JAMES HARMON, ISSUING the following SCHEDULING ORDER: (1) by August 24, 2021, the government shall file its motion for pretrial detention; (2) defendant shall file his opposition by August 30, 2021; (3) the government shall file any reply by September 1, 2021; and (4) the parties shall appear via videoconference at 9:30am on September 9, 2021, for a hearing on the government's motion for pretrial detention. Signed by Chief Judge Beryl A. Howell on August 19, 2021. (lcbah2)", "document_number":"None", "pacer_doc_id":"None", "short_description":"Order", "id":7, "matched_descriptions":[ "Minute Entry for proceedings held before Chief Judge Beryl A. Howell: Status Conference as to GARY JAMES HARMON held via videoconference on 8/19/2021; the Defendant agreed to participate via videoconference after consultation with counsel. A Bond Hearing is scheduled for 9/9/2021, at 9:30 AM before Chief Judge Beryl A. Howell; a briefing scheduling order will be issued by the Court. A further Status Hearing is scheduled for 9/23/2021, at 9:00 AM before Chief Judge Beryl A. Howell. The Defendant agreed to exclude time under the Speedy Trial Act until the next status hearing of 9/23/2021. The Court found that time under the Speedy Trial Act shall be excluded from 8/19/2021 through 9/23/2021, in the interests of justice and those interests outweigh the interests of the public and the defendant in a speedy trial in order to give the parties time to discuss a protective order, give the government time for production of discovery, and the Defendant and his counsel time to review and discuss the discovery. Bond Status of Defendant: Defendant committed. Present via videoconference: Defense Attorney: Sabrina P. Shroff; US Attorneys: Christopher B. Brown and Catherine Pelker. Court Reporter: Elizabeth Saint-Loth. (ztg)", "MINUTE ORDER (paperless), as to GARY JAMES HARMON, ISSUING the following SCHEDULING ORDER: (1) by August 24, 2021, the government shall file its motion for pretrial detention; (2) defendant shall file his opposition by August 30, 2021; (3) the government shall file any reply by September 1, 2021; and (4) the parties shall appear via videoconference at 9:30am on September 9, 2021, for a hearing on the government's motion for pretrial detention. Signed by Chief Judge Beryl A. Howell on August 19, 2021. (lcbah2)" ], "merge":false }


In this case the problem is that the target short description is "Scheduling Order" considering 50% of the words it only needs that one of them appears in the long description, in this case "Order" that is in both of them.
"ORDER granting... "  and "SCHEDULING ORDER..." 
I was wondering if we consider the order of words conflicts like these could be solved, but it could also affect other matches were the words of the short descriptions are not always in order within the long description.

{
"docket_number": "1:18-cr-00538", "court_id": "nyed", "date_filed": "2022-04-28", "description":"SCHEDULING ORDER as to Ng Chong Hwa. Sentencing set for 9/13/2022 at 10:00 AM in Courtroom 6F North before Chief Judge Margo K. Brodie..Ordered by Chief Judge Margo K. Brodie on 4/28/2022. (Valentin, Winnethka)", "document_number":"None", "pacer_doc_id":"None", "short_description":"Scheduling Order", "id":9, "matched_descriptions":[ "ORDER granting [203] Consent MOTION for Extension of Time to File Post-Trial Motions. Ordered by Chief Judge Margo K. Brodie on 4/28/2022. (Valentin, Winnethka)", "SCHEDULING ORDER as to Ng Chong Hwa. Sentencing set for 9/13/2022 at 10:00 AM in Courtroom 6F North before Chief Judge Margo K. Brodie..Ordered by Chief Judge Margo K. Brodie on 4/28/2022. (Valentin, Winnethka)" ], "merge":false }


This is also a clear conflict since both long descriptions matched contains the short description "Status Conference".

{ "docket_number": "1:18-cr-00538", "court_id": "nyed", "date_filed": "2022-02-25", "description":"MINUTE ENTRY: Status conference by telephone was held before Chief Judge Margo K. Brodie on February 25, 2022. AUSAs Drew Rolle, Alixandra Smith, and Dylan Stern, and Brent Wible and Jennifer Ambuehl of the Department of Justice, appeared on behalf of the government. Retained counsel Marc Agnifilo, Teny Geragos, Jacob Kaplan, and Zach Intrater appeared on behalf of Defendant Ng Chong Hwa. Henry Mazurek and Ilana Haramati appeared on behalf of Mr. Lessiner. The jury trial will continue on March 1, 2022, at 9:30 AM in Courtroom 4F North before Chief Judge Margo K. Brodie. The Court scheduled a status conference for February 28, 2022, at 10:00 AM. The parties will notify the Court if a status conference is needed by February 27, 2022. The call-in information for the telephone conference is 1-888-684-8852 and the access code is 9801036. If any party has any difficulty accessing the telephone, please call chambers at (718) 613-2140. Persons granted remote access to proceedings are reminded of the general prohibition against photographing, recording, and rebroadcasting of court proceedings. Violation of these prohibitions may result in sanctions, including removal of court-issued media credentials, restricted entry to future hearings, denial of entry to future hearings, or any other sanctions deemed necessary by the Court. (Court Reporter Anthony Frisilone.) (Valentin, Winnethka)", "document_number":"None", "pacer_doc_id":"None", "short_description":"Status Conference", "id":39, "matched_descriptions":[ "SCHEDULING ORDER as to Ng Chong Hwa. The Court scheduled a status conference for February 25, 2022, at 3:00 PM. The call-in information for the telephone conference is 1-888-684-8852 and the access code is 9801036. If any party has any difficulty accessing the telephone, please call chambers at (718) 613-2140. Persons granted remote access to proceedings are reminded of the general prohibition against photographing, recording, and rebroadcasting of court proceedings. Violation of these prohibitions may result in sanctions, including removal of court-issued media credentials, restricted entry to future hearings, denial of entry to future hearings, or any other sanctions deemed necessary by the Court. Ordered by Chief Judge Margo K. Brodie on 2/25/2022. (Valentin, Winnethka)", "MINUTE ENTRY: Status conference by telephone was held before Chief Judge Margo K. Brodie on February 25, 2022. AUSAs Drew Rolle, Alixandra Smith, and Dylan Stern, and Brent Wible and Jennifer Ambuehl of the Department of Justice, appeared on behalf of the government. Retained counsel Marc Agnifilo, Teny Geragos, Jacob Kaplan, and Zach Intrater appeared on behalf of Defendant Ng Chong Hwa. Henry Mazurek and Ilana Haramati appeared on behalf of Mr. Lessiner. The jury trial will continue on March 1, 2022, at 9:30 AM in Courtroom 4F North before Chief Judge Margo K. Brodie. The Court scheduled a status conference for February 28, 2022, at 10:00 AM. The parties will notify the Court if a status conference is needed by February 27, 2022. The call-in information for the telephone conference is 1-888-684-8852 and the access code is 9801036. If any party has any difficulty accessing the telephone, please call chambers at (718) 613-2140. Persons granted remote access to proceedings are reminded of the general prohibition against photographing, recording, and rebroadcasting of court proceedings. Violation of these prohibitions may result in sanctions, including removal of court-issued media credentials, restricted entry to future hearings, denial of entry to future hearings, or any other sanctions deemed necessary by the Court. (Court Reporter Anthony Frisilone.) (Valentin, Winnethka)" ], "merge":false }



There are more similar examples where one of the docket entries from the same date can't be properly matched due to more than one long description matched. I think these problems might be also present if we use a ML approach, don't you think? 

The match failure is rate is still far away from the desired 10%, however most of the conflicts are related to docket entries that are matched by two or more long descriptions, I think in production if we add an additional check that if a docket entry was already merged with its long description, we'll know that long description shouldn't match with any other short description. But this will depend in the order of long descriptions come in, or if they come in together. 

Let me know what you think.

Well, this is a tricky problem to be sure. I looked at a bunch of them and tried to come up with some thoughts.

A few observations:

It seems like a lot of cases might be fixed if you compare the percentages matched. For example, imagine if two entries match a short description on the same day, and one matches 100% of tokens while the other matches 50%. In this case, I think the better match is probably the one to merge.
There are two kinds of errors when merging:
- First: Merges that are wrong in meaning. For example, if you merge a "scheduling order" with a non-scheduling order, that's bad.
- Second: Merges that are just technically wrong. For example, if you have two long descriptions that are orders, and you merge a short description of "Order" with the wrong one. It doesn't really matter. You merged "Order" with an order. That's OK.
Some can't be merged correctly even by a human.
Some words are more important than others, and there could be opportunities for synonyms. For example, Minute Entry == Order. "Motion" is an important word when matching.
I agree that word order could be helpful with a lot of these, but perhaps not all.
In many cases, it's very clear to me, as a human, which should get merged because I can understand the words and know which have a high likelihood of changing the meaning of the short or long description. This hints that ML might work well.
I think your current method of looking for 50% seems crude since some words have more value than others.
Sometimes it's helpful to have a rule that takes care of the easy cases and another to deal with the rest.

For your stats can you clarify what the 22.56% refers to? Is that how many will fail globally or how many of the duplicates will fail?

Thank you for your comments and suggestions, some additional comments and ideas:

It seems like a lot of cases might be fixed if you compare the percentages matched. For example, imagine if two entries match a short description on the same day, and one matches 100% of tokens while the other match 50%. In this case, I think the better match is probably the one to merge.

Yeah, I was thinking something like this. However, for it to work, we would need to have the two or more long descriptions available to merge at the same time. This way, we can compare which one has the highest percentage. This might not work if, for example, we have two short descriptions for a date, and then a long description comes, for instance, via recap.email, which only contains one entry. In that case, we'll only have one long description to compare with the two possible short descriptions to merge in. I think this problem would be less common via the recap extension since they usually have many docket entries that could include all the entries available for a date (but it's still possible).

There are two kinds of errors when merging:

First: Merges that are wrong in meaning. For example, if you merge a "scheduling order" with a non-scheduling order, that's bad.

Got it. I think using the percentage approach when having all the long description entries available to merge simultaneously might help solve this kind of wrong merge. Alternatively, here could also work to rate words according to their importance. For example, in this case, "non-scheduling" could have a higher weight than "order," so the entry that contains "non-scheduling" would be merged.

Second: Merges that are just technically wrong. For example, if you have two long descriptions that are orders, and you merge a short description of "Order" with the wrong one. It doesn't really matter. You merged "Order" with an order. That's OK.

Yeah, this is correct. If there are two "Order" sections that are short, it doesn't matter which one is merged first. In the end, they will belong on the same date.

Some words are more important than others, and there could be opportunities for synonyms. For example, Minute Entry == Order. "Motion" is an important word when matching.

In order to check which words are more important than others, I think we can extract the most common words from minute entries and set a weight to these words. What do you think?

In many cases, it's very clear to me, as a human, which should get merged because I can understand the words and know which have a high likelihood of changing the meaning of the short or long description. This hints that ML might work well.

Does this mean we should do a test using ML so we could compare how well it performs and compare with heuristics? Or first, continue working on the heuristic?

Sometimes it's helpful to have a rule that takes care of the easy cases and another to deal with the rest.

Sure I’ll keep this in mind so we could use a rule for the simple ones and a different approach for edge cases.

For your stats can you clarify what the 22.56% refers to? Is that how many will fail globally or how many of the duplicates will fail?

Yeah, this percentage refers to how many minute entries will fail globally (according to my sample) to be merged, including duplicates and non-duplicate minute entries.

- Finally, I was thinking that in order to get a better sample, could we find more docket history reports that we have stored? I only used 25 dockets because it was difficult to find them manually since most of them don't contain long descriptions. So I could write a command that looks for docket history reports with long descriptions and returns the paths to download them so that we could have a bigger sample (maybe 200 docket reports)?

A new approach

I was able to bend JHawk's ear a bit about this issue in the #pacer channel today. He had some very good ideas, but the general concept is that we have another signal we can use to sort this out, the pacer_sequence_number. We get and store these for all RSS entries and we get them for any item with a document link on the docket sheet.

So what we can do is:

Merge the simple cases on a given day. That gets us 78% of the solution.
Then, for duplicates on a given day, we order the short descriptions by pacer_sequence_number, and we detect the order of the long descriptions (we already have code for this). Then, we merge.

I think that should work?

A few other thoughts from Slack:

Unfortunately, we don't get pacer_sequence_numbers for numberless entries, but we do get them for the items before and after a numberless entry. These could be useful for bounds checking, but I think we don't need that if we're using dates anyway.
The one issue that JHawk pointed out for this is that the pacer_sequence_number is created when the item is Entered into PACER, not when it is Filed by the attorney. Our docket sheets are usually ordered by date_filed, not date_entered, so there could be some difference here.

Luckily, for numberless entries like these, which are created by the court, the filed and entered should be the same.

A few follow-ups

I could write a command that looks for docket history reports with long descriptions and returns the paths to download them

If this is still useful, want to just give me some code and I can run it?

Does this mean we should do a test using ML?

Sounds like we won't need it!

A couple other thoughts...

Docket History reports are always complete, so when we get one of those, we could use it to clean up the docket and merge things properly. If we do that, we can be sure that Docket History reports never create duplicates, which is at least one thing fixed. I think this would be tricky to get right, so we probably shouldn't do it, but it's a thought I had.

Weird

It's weird that this fits the 80/20 rule almost exactly. We get 78% easily and 22% is going to be a pain. Weird.

@mlissner thanks, I'll be reviewing this new approach and doing some testing. I'm just afraid that for appellate RSS feeds we don't have the pacer_sequence_number, so we might need to still use some heuristics.

Yeah, having a bigger sample of history reports will be helpful, here is some code to get only DocketHistory reports with long descriptions.

from juriscraper.pacer import DocketHistoryReport
from cl.recap.models import PacerHtmlFiles, UPLOAD_TYPE

history_reports = PacerHtmlFiles.objects.filter(upload_type=UPLOAD_TYPE.DOCKET_HISTORY_REPORT).order_by("-date_created")
long_des_hr = []
for hr in history_reports.iterator():
    text = hr.filepath.read().decode()
    report = DocketHistoryReport("default")
    report._parse_text(text)
    data = report.data
    docket_entries = data["docket_entries"]
    for i, de in enumerate(docket_entries):
        if de["short_description"] and de["description"]:
            long_des_hr.append(hr)
            break
        if i > 2:
            break

    if len(long_des_hr) >=200:
        break

print("Docket history reports with long descriptions:")
for report in long_des_hr:
    print(f"ID: {report.pk} URL: {report.filepath}")

It works by looking for DOCKET_HISTORY_REPORT PacerHtmlFiles files and checking if their docket entries contain short and long descriptions, we'll only check a couple of entries for each file in case the first one is not representative.

The script will end once we have 200 DocketHistory reports with long descriptions, so it might take a while.

I'm running the script now. Good point about appellate, but we can worry about that next. Let's get district taken care of and then deal with corner cases in a second pass.

Here are the results, collapsed to keep our conversation smaller:

Click to view all

``` ID: 6031374 URL: recap-data/2023/03/23/24861038e96a4883b313780e45473831.html ID: 6030471 URL: recap-data/2023/03/23/22f440ac6f274b3c93b967bfd949d070.html ID: 6030466 URL: recap-data/2023/03/23/61b5fbe2b3754969921806f482279da0.html ID: 6030460 URL: recap-data/2023/03/23/bf29cd941e1a456887e2b7b749dc2468.html ID: 6029952 URL: recap-data/2023/03/23/45b93595b08947c3a56f9c4b579f181e.html ID: 6029943 URL: recap-data/2023/03/23/2ca935d2ccfc490abcaed60d330e50a1.html ID: 6029626 URL: recap-data/2023/03/23/debabbcf8e044c4082e6cc4e2b2bf6ec.html ID: 6029625 URL: recap-data/2023/03/23/eeee623c65254baab67963362fc325a6.html ID: 6029568 URL: recap-data/2023/03/23/4d3d1230a76b44b380a765277765843b.html ID: 6028901 URL: recap-data/2023/03/22/6671087d6e5e43728f1e3d1461451047.html ID: 6028899 URL: recap-data/2023/03/22/0dd4306e5c2344eebdb43070c6da6e36.html ID: 6028779 URL: recap-data/2023/03/22/e49152f498a34cb989fc5cab89297974.html ID: 6027843 URL: recap-data/2023/03/22/a627919961044b9fab8481776ddd056a.html ID: 6027790 URL: recap-data/2023/03/22/f87682f0aaa54c0495832231cedf6b06.html ID: 6026180 URL: recap-data/2023/03/22/b04f8570753b498bb570190ff1ebb55e.html ID: 6025381 URL: recap-data/2023/03/22/00d873d4638644f19915141568c5b44a.html ID: 6025340 URL: recap-data/2023/03/22/a5b65b06d8464908b8ae6ba6fd6df66f.html ID: 6025339 URL: recap-data/2023/03/22/a5cdaef8a5dc4e60abe3d23c100a104d.html ID: 6025326 URL: recap-data/2023/03/22/6f76fb7668b0416d8e4b19942d874b57.html ID: 6025324 URL: recap-data/2023/03/22/931c6e8e3a1d47a2986d436b949efe5f.html ID: 6025320 URL: recap-data/2023/03/22/89b5fdba77584d3dad1c8ff028e52ff9.html ID: 6025303 URL: recap-data/2023/03/22/a9e6984125d1494f9c9d29814111fe31.html ID: 6025298 URL: recap-data/2023/03/22/e5bc20ceb960464b9c653626e19f8fe8.html ID: 6025233 URL: recap-data/2023/03/22/d3949d9199d84fd58f9f36379e337b1a.html ID: 6025222 URL: recap-data/2023/03/22/0a0a844ffade48ad9e29c1fe607d2e22.html ID: 6025143 URL: recap-data/2023/03/22/1d6949f2b5684f20b6a75646aae9050a.html ID: 6025101 URL: recap-data/2023/03/22/69c270e5e57948708c0c85f5e5fb2cd5.html ID: 6024532 URL: recap-data/2023/03/21/a7f823ef2c6f42f8b7817866ab090e0e.html ID: 6023611 URL: recap-data/2023/03/21/2ccc1d937404427386a4b13053af0f3e.html ID: 6023593 URL: recap-data/2023/03/21/00c186dba2f645a2b05206ceeb724f26.html ID: 6022645 URL: recap-data/2023/03/21/ce33ae862116448bbb9533b36fa05c95.html ID: 6022640 URL: recap-data/2023/03/21/e2a72ce0a1cf42b2b7cf399f4799cfc9.html ID: 6022420 URL: recap-data/2023/03/21/df072a4a8cae4e958759e14835c09f8e.html ID: 6021628 URL: recap-data/2023/03/21/c86e0afe429c4d3abfe9706843f6a6a0.html ID: 6021372 URL: recap-data/2023/03/21/a76e82af058f4277b7bd58aa14c3e50d.html ID: 6020809 URL: recap-data/2023/03/21/04afc3724f8b4d5a8ed7a22943511511.html ID: 6020398 URL: recap-data/2023/03/21/6b6642d07cf8477e843842046030b8bd.html ID: 6020360 URL: recap-data/2023/03/21/0f771cd8bee34fd49ca367db98202700.html ID: 6020251 URL: recap-data/2023/03/21/b7090ea05cbf421fb997bd399bbfd04b.html ID: 6020247 URL: recap-data/2023/03/21/68c95d7c01fd4cff9f3d3c51ed20a019.html ID: 6019262 URL: recap-data/2023/03/20/8843deaf8deb414fb683c39f9dbd5fae.html ID: 6019076 URL: recap-data/2023/03/20/4dcc2bb988ae4a9b8c11869ae245d33e.html ID: 6018981 URL: recap-data/2023/03/20/6b36f94d212243dca68581ced7755967.html ID: 6018957 URL: recap-data/2023/03/20/28a2794ae45e4ce19d344d7f8810ce9a.html ID: 6018659 URL: recap-data/2023/03/20/b17ea0a6e5fa4e3abd5e9e5c55ac1aa2.html ID: 6018621 URL: recap-data/2023/03/20/1e5d6063b6964500a3a73f5a2bb994c2.html ID: 6018572 URL: recap-data/2023/03/20/263748ccb8b14e91bc35aa21580c31ca.html ID: 6018551 URL: recap-data/2023/03/20/0f9a32d390ba440e96eeafb77529ca63.html ID: 6018064 URL: recap-data/2023/03/20/13a334e0308e41a2a8792b4e0878c613.html ID: 6017879 URL: recap-data/2023/03/20/1d958a74a02c422091056abbe902c2c6.html ID: 6017159 URL: recap-data/2023/03/20/de32d42869e04181808bbe5b4110edff.html ID: 6016577 URL: recap-data/2023/03/20/08595147d3dc4a8fa2b006f1d30704a4.html ID: 6016118 URL: recap-data/2023/03/20/b04d089f43934e258a866fa167be9442.html ID: 6014494 URL: recap-data/2023/03/19/7dec7132391a4cb19c7f174fcfea06e0.html ID: 6014490 URL: recap-data/2023/03/19/1fecf3444bb64983a3cd4f4f13ef773f.html ID: 6014489 URL: recap-data/2023/03/19/2e2bfe3cd6664182bbc111d275b4ba0e.html ID: 6014467 URL: recap-data/2023/03/19/4dd189b1e4254c57aa6ef5c7ddbe90ba.html ID: 6014461 URL: recap-data/2023/03/19/b6e5006052c447de9b4b8c3fb4ea74de.html ID: 6014408 URL: recap-data/2023/03/19/085f68749d04409096dd90f0f9e39170.html ID: 6013622 URL: recap-data/2023/03/18/50536968edb34acd909a8700fd58e45c.html ID: 6013310 URL: recap-data/2023/03/18/1791a2ee87dd4f36a283d2a3be285134.html ID: 6013307 URL: recap-data/2023/03/18/401eab6844934b30a9ac55e21fa8d1c0.html ID: 6013303 URL: recap-data/2023/03/18/c7e328332f7c40359b21c0b7ced0a119.html ID: 6013256 URL: recap-data/2023/03/18/02bbb6f9358c4b9fb9901398adcb19b7.html ID: 6013252 URL: recap-data/2023/03/18/9ddcc137dc4c4ea79627be8309f8f822.html ID: 6013134 URL: recap-data/2023/03/18/89c4b792ff024a58a8497a19c483008b.html ID: 6013122 URL: recap-data/2023/03/18/dd63be919be9437c867316aa88928c45.html ID: 6013113 URL: recap-data/2023/03/18/7b7d2fa49bae41959dd1805f8bd111d6.html ID: 6013110 URL: recap-data/2023/03/18/ae4efe9b3e5349378f4489ebe7ab8c72.html ID: 6013105 URL: recap-data/2023/03/18/bf04c929f6134b538caffe7b8b3f46c7.html ID: 6013104 URL: recap-data/2023/03/18/e91d669d6c354b01a1f707cb90c24ae1.html ID: 6013103 URL: recap-data/2023/03/18/3e43583be0ce4c81a8b32698143d571b.html ID: 6013101 URL: recap-data/2023/03/18/7a21e2f7adf244a185d5280e12b72764.html ID: 6013100 URL: recap-data/2023/03/18/c5f65a9bd9014d8196189335b1408641.html ID: 6013096 URL: recap-data/2023/03/18/ccc34149c283493095c9508a1d09f1fd.html ID: 6013094 URL: recap-data/2023/03/18/5efcacfb42834f92b3d63b8257e97f71.html ID: 6013056 URL: recap-data/2023/03/18/0e37d4981c5e481282bb27ffad4fa4c8.html ID: 6013053 URL: recap-data/2023/03/18/02fcbaf9502a4e7cbc26459cf2b0471b.html ID: 6013052 URL: recap-data/2023/03/18/d93106a898464f9c9ddcae0635bdddd0.html ID: 6013049 URL: recap-data/2023/03/18/e9d1ab842bae4a5eba0314262148d84e.html ID: 6013044 URL: recap-data/2023/03/18/a259a9db4c614fa7b82b6a9efc915b90.html ID: 6013039 URL: recap-data/2023/03/18/c089a5cafbe34f04bff1c9c1d35d9d7a.html ID: 6013004 URL: recap-data/2023/03/18/169b2b41f975451fb0933aa866779cc3.html ID: 6012996 URL: recap-data/2023/03/18/2eecbd660d3448959e2b1b477e418a0d.html ID: 6012994 URL: recap-data/2023/03/18/386bee582c43472abd31dc3bb616c96d.html ID: 6012993 URL: recap-data/2023/03/18/e29fb9620ce64cdda6f515aa34db8ca4.html ID: 6012982 URL: recap-data/2023/03/18/90fff2036c5f43fc816cd7bcd26a1794.html ID: 6012979 URL: recap-data/2023/03/18/15f3c96b1e6043b69d6e7db94c4add00.html ID: 6012917 URL: recap-data/2023/03/18/590a0dd58e8249e2975a695b07526832.html ID: 6012911 URL: recap-data/2023/03/18/ebe3ff4b3667493b901fb5f181cd7364.html ID: 6012910 URL: recap-data/2023/03/18/a456e3deb32644cdb689d6f1fac9158c.html ID: 6012903 URL: recap-data/2023/03/18/63a98f2c77ae4a918d00bc942db7b06c.html ID: 6012876 URL: recap-data/2023/03/18/d25383ccc31143399000ca10f0c7ca57.html ID: 6012874 URL: recap-data/2023/03/18/03dcc7c6c5f54078bf0c8045d7f627eb.html ID: 6012392 URL: recap-data/2023/03/17/e7b362e19af8479089243b53866a05c9.html ID: 6011544 URL: recap-data/2023/03/17/49dcd29d40b24e61919c3d7208910611.html ID: 6011005 URL: recap-data/2023/03/17/00329f046ac14d4eb14442e927878dc6.html ID: 6009743 URL: recap-data/2023/03/17/7c566d773df74ad481286c6371cb7eff.html ID: 6009546 URL: recap-data/2023/03/17/1ac5b05dc8d44aa19a3b85dd44358f62.html ID: 6009468 URL: recap-data/2023/03/17/5aa8de7225e64b17ace6ef0642aa440b.html ID: 6009467 URL: recap-data/2023/03/17/7099fc1a853f492384367b20eac72982.html ID: 6007820 URL: recap-data/2023/03/16/a9d48ee00e7c4804b1a138c53c603023.html ID: 6007813 URL: recap-data/2023/03/16/7b6222371df54459b2e912112f78896b.html ID: 6006679 URL: recap-data/2023/03/16/c5aae6e7e34849cb91cbcf9ed28b7f7b.html ID: 6006615 URL: recap-data/2023/03/16/f670e60a27244cc093b94ef7e9d00847.html ID: 6006464 URL: recap-data/2023/03/16/297f32aa774143eaa7a2a1769c69127c.html ID: 6006393 URL: recap-data/2023/03/16/6f84113efb1343c98223efa9e5f5cb66.html ID: 6006377 URL: recap-data/2023/03/16/f241032e69664b4faf3583cdae836a8e.html ID: 6006358 URL: recap-data/2023/03/16/d3084b12c8e44a9d8aca318a3771e09f.html ID: 6006069 URL: recap-data/2023/03/16/ee86d26150924eecbcab777c5247b3ba.html ID: 6006020 URL: recap-data/2023/03/16/ead1ea70536348fd9e0ad68a0ec6c1b2.html ID: 6005238 URL: recap-data/2023/03/16/cd3b05c9ddf846adafa1cd1be8aed8a1.html ID: 6004836 URL: recap-data/2023/03/16/9dcae064747944d484e9ebff7bde59bd.html ID: 6004835 URL: recap-data/2023/03/16/2a2909e106bb4cd0846df726b4e46eeb.html ID: 6004808 URL: recap-data/2023/03/16/6f04022348344e41b09034fe0d7ba7f4.html ID: 6004664 URL: recap-data/2023/03/16/ae33fd4e5909436e84c03ec4cb4f91f3.html ID: 6004323 URL: recap-data/2023/03/15/a6849fd5d0b74e2d9a5f6b4146ee588b.html ID: 6004306 URL: recap-data/2023/03/15/e8d37cf70dd04f9eb21f5f44aaf0a2c2.html ID: 6003381 URL: recap-data/2023/03/15/b50e3f5d6d0e41ccb04b91e6fa78b0fa.html ID: 6003086 URL: recap-data/2023/03/15/d4eda110f4ef44bc9c89b866ea43423f.html ID: 6002937 URL: recap-data/2023/03/15/1446a6c2abaf47ce9fdf283025ca2aca.html ID: 6002421 URL: recap-data/2023/03/15/b2b975f3d0ee415aba38653a3dc17997.html ID: 6002409 URL: recap-data/2023/03/15/386b18cfe7614cd49acf158acc103124.html ID: 6002388 URL: recap-data/2023/03/15/e9676f5b110b40d2bb6e1a8530c586ab.html ID: 6002282 URL: recap-data/2023/03/15/32e4eed3a4c24f1d9a672a480d675493.html ID: 6001702 URL: recap-data/2023/03/15/1db9bee4ad974768b8c0182661b1445c.html ID: 6000732 URL: recap-data/2023/03/15/ece23c7530744607baf43eb92dc5fc01.html ID: 6000573 URL: recap-data/2023/03/15/83bef315bf7e47db9659dfba87c31617.html ID: 6000318 URL: recap-data/2023/03/15/e7f5e57157184fe597330daf60b7f196.html ID: 5999692 URL: recap-data/2023/03/14/c406ccbd16ef47fc9c1ba0f895c7ef85.html ID: 5998765 URL: recap-data/2023/03/14/ad97723512e649aeaabb1f5f1bb5ce1b.html ID: 5998442 URL: recap-data/2023/03/14/b4ad1a32c412434ea52aa92157851f4c.html ID: 5998415 URL: recap-data/2023/03/14/56d737a563e34bfd93ec8ab820dea922.html ID: 5997727 URL: recap-data/2023/03/14/41ff9ace7fc44740bde0892495a1d936.html ID: 5997712 URL: recap-data/2023/03/14/c4871a2303c14b258e80d26f0f77e41d.html ID: 5997707 URL: recap-data/2023/03/14/24e9623b36824c63b05b41265582b391.html ID: 5997668 URL: recap-data/2023/03/14/cb235715e40d4f7f83fc0c69d13f0eac.html ID: 5997316 URL: recap-data/2023/03/14/b7b4e829903846deb6ced28ee51f211b.html ID: 5997310 URL: recap-data/2023/03/14/475eafb5f09d41178434238726b57152.html ID: 5997302 URL: recap-data/2023/03/14/432b3a9a6f6742f685ab684fd51261ec.html ID: 5997238 URL: recap-data/2023/03/14/04629a52d2894d559607c2b01b1831af.html ID: 5996900 URL: recap-data/2023/03/14/828c5dba251947e684470d609323b8bb.html ID: 5995616 URL: recap-data/2023/03/14/6cff17fc511643098c444ba5d34558a7.html ID: 5995429 URL: recap-data/2023/03/14/9e29d410ac9941c5a675416daff84dba.html ID: 5995424 URL: recap-data/2023/03/14/fe22f89d791f47f38237d085263e0a62.html ID: 5995402 URL: recap-data/2023/03/14/d0bb584c23fa434fbe36877c40a66150.html ID: 5995332 URL: recap-data/2023/03/14/ae9510423a1b4e74a8e90542d2347f2f.html ID: 5995327 URL: recap-data/2023/03/14/8bbd29b00eff4e4f9fbb667d33180c62.html ID: 5995324 URL: recap-data/2023/03/14/1bb88046bb194a3abc492e8952ea8b89.html ID: 5995322 URL: recap-data/2023/03/14/433339b4abd9428d8f675f5125764a2a.html ID: 5995300 URL: recap-data/2023/03/14/cb3fc40576814c898d4abeccebd0fae5.html ID: 5995284 URL: recap-data/2023/03/14/feea88106012415b81dd26345afcae13.html ID: 5995271 URL: recap-data/2023/03/14/ea37f8d210d24da5ba58b50aeb9bc1df.html ID: 5995268 URL: recap-data/2023/03/14/c67ca412a6ff470e9c9564a5a8041f43.html ID: 5995265 URL: recap-data/2023/03/14/9b53f199a03a417c8aae4f62c5a77f6a.html ID: 5995218 URL: recap-data/2023/03/14/e48a24932b034398875d00e5df755c54.html ID: 5995208 URL: recap-data/2023/03/14/de8b229afde6498eb00f69a320303cdf.html ID: 5995207 URL: recap-data/2023/03/14/20ac8c5f4f2348c2b1b1430f5e726930.html ID: 5995203 URL: recap-data/2023/03/14/a104910be16f40f59510d8c29797837c.html ID: 5995202 URL: recap-data/2023/03/14/175b1c6206d84d8b81afaa123655ea7b.html ID: 5995193 URL: recap-data/2023/03/14/0208ed7407df4a8bbc6467270291cfcd.html ID: 5995187 URL: recap-data/2023/03/14/d3dcf19385db4a4caed542b303f7efeb.html ID: 5995146 URL: recap-data/2023/03/14/8d8815f8244c42ad88c79c72701cc459.html ID: 5995145 URL: recap-data/2023/03/14/1d1f9b484fca4a6187ba9dff598c2902.html ID: 5995144 URL: recap-data/2023/03/14/cfd3f19b1b9e4b67b7f055639b0b1a2b.html ID: 5995141 URL: recap-data/2023/03/14/db403092f0884a52b1a07d7d947bf26f.html ID: 5995083 URL: recap-data/2023/03/14/0c435ba314c646d7af060b2d712aa9cf.html ID: 5995082 URL: recap-data/2023/03/14/ce4b0783e6de4d39bdfeceed78d8d3b9.html ID: 5995081 URL: recap-data/2023/03/14/4db92da5ab374538879c7c1da88e706b.html ID: 5995080 URL: recap-data/2023/03/14/14956911925a44d0879293f2de199f2e.html ID: 5995079 URL: recap-data/2023/03/14/bdcbfa380d88452da4e820b36acab1dc.html ID: 5995078 URL: recap-data/2023/03/14/7daf27576ab84a0b9d9e4e24b4a12213.html ID: 5995077 URL: recap-data/2023/03/14/976b8e4d511843f1989bbe2b26d2f956.html ID: 5995076 URL: recap-data/2023/03/14/749480cea9ed417ca8f020eefdbb508b.html ID: 5995075 URL: recap-data/2023/03/14/0d6d6eae22304924bba7c53cf43f775e.html ID: 5995074 URL: recap-data/2023/03/14/c0866d5872304be582ec30a9883e8334.html ID: 5995073 URL: recap-data/2023/03/14/77dfc01ae0614ca0ba5c2ec1d5a0e03f.html ID: 5995072 URL: recap-data/2023/03/14/f2674d3edd2b4e3baa5b0fdae163a875.html ID: 5995071 URL: recap-data/2023/03/14/81a7fcbf6bfd4ab6bd3651d13cfdecd0.html ID: 5995070 URL: recap-data/2023/03/14/51ea0f695858491196ab80b602057675.html ID: 5995056 URL: recap-data/2023/03/14/691a1f540d184e0490ab9e26c6b47e94.html ID: 5995054 URL: recap-data/2023/03/14/0ece719ad3e541109596cd1faf8902fc.html ID: 5995053 URL: recap-data/2023/03/14/c7a43bb403904dc49d6cd45a2ef8d808.html ID: 5995037 URL: recap-data/2023/03/14/d969b76342a54176a06030d0251f4d77.html ID: 5995036 URL: recap-data/2023/03/14/3aa9f21b29804b34b339a2b31c6530c8.html ID: 5995023 URL: recap-data/2023/03/14/e0bb33f288944fa0b10b35b9935d1764.html ID: 5995019 URL: recap-data/2023/03/14/770304123863466eb70c71a0ff75f1a3.html ID: 5995018 URL: recap-data/2023/03/14/69a8076e744b4fdd8d4a0407fed022eb.html ID: 5994996 URL: recap-data/2023/03/14/2ea929a37bb04708a7acb0d8fc69ca2e.html ID: 5994992 URL: recap-data/2023/03/14/840130c8adba449d94db05c1417aefc0.html ID: 5994953 URL: recap-data/2023/03/14/dc8243c7669b4927b567b9eece1f5f27.html ID: 5994914 URL: recap-data/2023/03/14/dfa65534640d482895a830aabfb9087b.html ID: 5994913 URL: recap-data/2023/03/14/a2eeb390d7054f73a9acabb1853b0025.html ID: 5994912 URL: recap-data/2023/03/14/8db8949186564508a9c6c4ea346b90ef.html ID: 5994911 URL: recap-data/2023/03/14/165fa1f1d06d452b969b4e033b42d483.html ID: 5994910 URL: recap-data/2023/03/14/1f6945917be34762afce49769bf1b89c.html ID: 5994909 URL: recap-data/2023/03/14/ac31d7b6f1ad4196a3f1774ef873694d.html ID: 5994908 URL: recap-data/2023/03/14/993720d8e5664201909627c43825cc52.html ID: 5994907 URL: recap-data/2023/03/14/85caee046e7a40198a6a2b9c4e88ef6b.html ID: 5994905 URL: recap-data/2023/03/14/2409ff88a3ea41ad991f9a3f8c40eff2.html ```

I dug into the new proposed approach of using the pacer_sequence_number to sort short descriptions for a proper merging of long descriptions. However, as you mentioned, numberless entries do not have a pacer_sequence_number, and since our problem is specifically related to such entries, this approach cannot be used.

As you also mentioned, we could use dates and times to sort the short descriptions and then merge them. However, I discovered a couple of issues while checking this:

- It appears that there is a bug when parsing and merging entries that have the same metadata but different descriptions.

e.g: https://www.courtlistener.com/docket/60109244/united-states-v-slaeker/

If you look at this docket you'll find two groups of entries where each group should be only one entry. Screenshot 2023-03-27 at 22 46 52

The problem is that the descriptions of these entries are the same. But they are sorted differently on each entry. When the feed was parsed by Juriscraper the entries with the same metadata and different descriptions are join by appending an "AND" but the problem is that sometimes the order of these entries varies according to the time the feed was parsed: e.g: dcd-08-13-2021-17-21-03.txt dcd-08-14-2021-14-43-31.txt

If you open the RSS feed dcd-08-13-2021-17-21-03.txt and search for Fri, 13 Aug 2021 14:43:06 GMT

You'll see the following group of entries: 1:21-mj-00545-1 USA v. SLAEKER:

[Speedy Trial - Excludable Start] [~Util - Set/Reset Hearings] [Order on Motion to Exclude] [Initial Appearance]

Description after being merged by juriscraper: Speedy Trial - Excludable Start AND ~Util - Set/Reset Hearings AND Order on Motion to Exclude AND Initial Appearance

Then if you open the next file which is the feed for the same court but many hours after (also search for Fri, 13 Aug 2021 14:43:06 GMT), you'll see the same group of entries but in a different order:

1:21-mj-00545-1 USA v. SLAEKER

[~Util - Set/Reset Hearings] [Speedy Trial - Excludable Start] [Initial Appearance] [Order on Motion to Exclude]

As you can see, the order changes, so entries are merged by Juriscraper in a different order: ~Util - Set/Reset Hearings AND Speedy Trial - Excludable Start AND Initial Appearance AND Order on Motion to Exclude.

Since these are numberless entries, the only way to detect the same entry when merging them in CL, is the description but as you could see descriptions changes due to the order, so the previous entry is not found and a new one is added.

So, this problem is causing duplicate entries, and for the same reason, it could affect the approach of ordering short descriptions on the same day. Therefore, we could merge their long descriptions properly.

I think this problem might be solved first. We could do something like in CL: if there is an existing docket entry on the same day and the short description contains "AND," split the descriptions into parts, and then compare part by part instead of the whole short description. If all the parts match, we could say it is the same entry.

- In another hand, here is an example here about the difference between the date filed and the date entered that you mentioned:

If you look at the file:

dcd-08-13-2021-17-21-03.txt

And search for 1:21-mj-00545-1

You'll see 11 entries, two groups of 4 entries with the same date, like the one described before, and 3 other entries with different times and descriptions.

1:21-mj-00545-1 USA v. SLAEKER

Appoint Counsel Thu, 12 Aug 2021 18:02:47 GMT

Exclude Thu, 12 Aug 2021 18:05:58 GMT

Order on Motion to Appoint Counsel AND Speedy Trial - Excludable Start AND Order on Motion to Exclude AND Initial Appearance Thu, 12 Aug 2021 22:47:41 GMT

Exclude Fri, 13 Aug 2021 14:34:39 GMT

Speedy Trial - Excludable Start AND ~Util - Set/Reset Hearings AND Order on Motion to Exclude AND Initial Appearance Fri, 13 Aug 2021 14:43:06 GMT

And if you look at the docket or docket history report, the date filed for some of these entries is 08/09/2021 and 08/12/2021:

Screenshot 2023-03-28 at 10 02 24

But seems that the order of these entries as they appear in the feed (the date_filed we use) matches the order of the docket or docket history report. I just want to be sure that's true, could you help me to confirm it? for example, the first entry's short description is Appoint Counsel and in the docket history report it says Motion to Appoint Counsel or the second one says Exclude and into the docket history report says Motion to Exclude, are they the same?

If we can confirm the order of the entries as they appear in the feed is the same that the order of entries in the docket sheet, I think we'll be able to use the date_filed and time_filed or the recap_sequence_number to order short descriptions before merging the long descriptions (I checked other examples and this seems to be true, just this is a bit tricky).

- An additional point:

Before adding time_filed to docket entries and parsing the timezone from RSS feeds, the recap_sequence_number had the format 2021-08-12T18:05:58+00:00.001. After adding time_filed, the recap_sequence_number is computed based only on the date_filed, using the same format we use when adding entries from dockets: 2021-08-11.001.

However, I noticed a possible issue when adding multiple entries for a docket from an RSS feed on the same day. The recap_sequence_number is the same for all the entries added on that day, because different entries for the same docket are reported independently. Therefore, when computing the recap_sequence_number, the counter remains the same.

A couple of possible solutions for this might be:

To ensure that the recap_sequence_number can be properly computed when added into CL, we need to group docket entries when parsing the RSS feed in Juriscraper. This means that if there is more than one entry for the same docket, they should be added within the same docket, so that the docket_entries key will have multiple entries instead of one. However, this approach may require an additional tweak. If more entries for the same day are added afterward, they will start from recap_sequence_number counter 1. To address this, we may need to query the last entry on the database for that day so that the next recap_sequence_number counter can continue from the previous one.
Alternatively, we could go back to using both the date and time to compute the recap_sequence_number. This may be the easiest solution, as using the date and time will result in different recap_sequence_number even when the counter is the same (1).

So in a brief:

There is a bug that leads to duplicated entries when merging docket entry descriptions from RSS feeds for the same date and time, if the same entries appear in a different order the next time the feed is crawled.
Currently, the recap_sequence_number for docket entries added by RSS feed is the same for entries added on the same day. We can fix this issue using one of the solutions suggested above.
To order short descriptions and then merge long descriptions, we can't use the pacer_sequence_number since minute entries don't get this field. Instead, we can use the date_filed and time_filed (for entries where available) or the recap_sequence_number (for entries that don't have time_filed) to order short descriptions. This is because the date_filed (and time_filed) for entries added via an RSS feed is the date the entry was published in the feed. The order in which entries appear in the feed seems to be the same as the order in which they appear in the docket sheet, but we need to confirm this is true. There are some tricky examples, such as the one described above, where the order seems a bit confusing.

Let me know what you think.

Man, Alberto, you're in the guts of things now! Good research. A few replies, since there's a lot to think about here:

1. The RSS identical date arbitrary ordering problem

The problem is that the descriptions of these entries are the same. But they are sorted differently on each entry.

This is annoying. It must be their database ordering things in arbitrary order when the dates are identical. There's a simple solution to this. Let's update Juriscraper to order these items alphabetically regardless of the order they appear in the feed, if they have the same dates.

You suggested that we:

if there is an existing docket entry on the same day and the short description contains "AND," split the descriptions into parts, and then compare part by part instead of the whole short description. If all the parts match, we could say it is the same entry.

But I think that won't be necessary if we handle it in Juriscraper.

2. RSS ordering vs. Docket Sheet ordering question

You asked:

But [it] seems that the order of these entries as they appear in the feed (the date_filed we use) matches the order of the docket or docket history report. I just want to be sure that's true, could you help me to confirm it?

This is safe to assume, but with two caveats:

When there are multiple entries at the same moment, we've seen they don't get ordered consistently in RSS. I wouldn't be surprised if the docket sheet ordered them inconsistently for the same reason.
The docket sheet can be ordered by date filed or date entered. By default it's by date filed, but users can choose. We could detect the ordering in Juriscraper and only use the ones ordered by date_filed for merging this way, if we wanted to be clever about this.

3. Does `recap_sequence_number` need tweaks?

You observed:

After adding time_filed, the recap_sequence_number is computed based only on the date_filed, using the same format we use when adding entries from dockets: 2021-08-11.001.

I'm assuming we did this because we don't get the time value when we get docket reports, so including the time in some sequence numbers but not all of them was problematic? If so, I don't think we can continue adding the time to this field.

Of course, if adding the time back to this field doesn't cause trouble, it does seem like a simple solution though!

Your other, more complicated approach of

the docket_entries key [having] multiple entries instead of one.

And:

[querying] the last entry on the database for that day so that the next recap_sequence_number counter can continue from the previous one.

Makes sense to me, even though it seems like a pain! :)

Thanks for raising these tricky issues. I'm glad to see you wrestling with this beast.

Great, thanks for your answers!

Yeah, ordering descriptions in Juriscraper will be simpler, I'll work on it.

About recap_sequence_number yes we removed the time from it to standardize the sequence numbers, but I don't remember if was a problem related to it, I'll confirm it.

freelawproject / bigcases2