freelawproject / recap

This repository is for filing issues on any RECAP-related effort.
https://free.law/recap/
12 stars 4 forks source link

RECAP does not parse or preserve criminal charge information #230

Closed johnhawkinson closed 6 years ago

johnhawkinson commented 6 years ago

A significant part of the PACER docket metadata for criminal cases is the criminal charge information. It appears in the Parties section of the docket report. Although the RECAP client sends the entire docket report to the server, the juriscraper parser and the CourtListener server neither parse nor store nor display this information.

Note that this is not a regression -- the historical IA RECAP server did not handle this information either. I thought about this recently when looking more closely at the parser in https://github.com/freelawproject/courtlistener/issues/754#issuecomment-356160196

For instance, in the case in question in that comment, https://www.courtlistener.com/docket/6254625/parties/united-states-v-benzadon-boutin/, the defendant's section of the docket report has this information:

Assigned to: Chief Judge Dora Lizette Irizarry

Defendant (1)
Salomon Benzadon Boutin
TERMINATED: 01/03/2018
represented byMia Eisner-Grynberg
Federal Defenders of New York
One Pierrepont Plaza
16th Floor
Brooklyn, NY 11201
718-330-1257
Fax: 718-855-0760
Email: mia_eisner-grynberg@fd.org
LEAD ATTORNEY
ATTORNEY TO BE NOTICED

Peter Kirchheimer
Federal Defenders of New York, Inc.
One Pierrepont Plaza, 16th Floor
Brooklyn, NY 11201
(718) 330-1200
Fax: (718) 855-0760
Email: Peter_Kirchheimer@fd.org
TERMINATED: 12/11/2017
LEAD ATTORNEY
ATTORNEY TO BE NOTICED
Designation: Public Defender or Community Defender Appointment

S. Isaac Wheeler
Federal Defenders of New York, Inc.
52 Duane Street, 10th Floor
New York, NY 10007
(212)417-8717
Fax: (212)571-0392
Email: isaac_wheeler@fd.org
ATTORNEY TO BE NOTICED
Designation: Public Defender or Community Defender Appointment

Pending Counts

Disposition
None

Highest Offense Level (Opening)
None

Terminated Counts

Disposition
Attempted Money Laundering- Title 18, United States Code, Sections 1956(a)(3)(B), 2 and 3551 et seg.)
(1)
Dismissed on deft's motion.
Theft of Public Property- Title 18, United States Code, Sections 641, 2 and 3551 et seq
(2)
Dismissed on deft's motion.

Highest Offense Level (Terminated)
Felony

Complaints

Disposition
18 USC 1956

Which is formatted as:

screen shot 2018-01-09 at 18 59 12

Nothing from Pending Counts to Complaints is handled by the RECAP system, and it all should be. Honestly, there's an argument that maybe the CL docket page should provide a link to the raw underlying HTML. Perhaps one day we'll be at a point where it is all parsed, but that day might be far off. Why not let users see it?

mlissner commented 6 years ago

Honestly, there's an argument that maybe the CL docket page should provide a link to the raw underlying HTML. Perhaps one day we'll be at a point where it is all parsed, but that day might be far off. Why not let users see it?

That's an interesting idea. The problem is...which underlying HTML? We store every upload that we get...some may be incomplete, some may be old? Maybe we provide it all as a list for users to sort out themselves?

brianwc commented 6 years ago

Saving all the HTML is probably unworkable. We should just start parsing and storing this super interesting data. Criminal attorneys would probably LOVE to be able to search by these fields.

On Jan 9, 2018 4:05 PM, "John Hawkinson" notifications@github.com wrote:

A significant part of the PACER docket metadata for criminal cases is the criminal charge information. It appears in the Parties section of the docket report. Although the RECAP client sends the entire docket report to the server, the juriscraper parser and the CourtListener server neither parse nor store nor display this information.

Note that this is not a regression -- the historical IA RECAP server did not handle this information either. I thought about this recently when looking more closely at the parser in freelawproject/courtlistener#754 (comment) https://github.com/freelawproject/courtlistener/issues/754#issuecomment-356160196

For instance, in the case in question in that comment, https://www.courtlistener.com/docket/6254625/parties/united- states-v-benzadon-boutin/, the defendant's section of the docket report has this information: Assigned to: Chief Judge Dora Lizette Irizarry

Defendant (1) Salomon Benzadon Boutin TERMINATED: 01/03/2018 represented by Mia Eisner-Grynberg Federal Defenders of New York One Pierrepont Plaza 16th Floor Brooklyn, NY 11201 718-330-1257 <(718)%20330-1257> Fax: 718-855-0760 <(718)%20855-0760> Email: mia_eisner-grynberg@fd.org LEAD ATTORNEY ATTORNEY TO BE NOTICED

Peter Kirchheimer Federal Defenders of New York, Inc. One Pierrepont Plaza, 16th Floor Brooklyn, NY 11201 (718) 330-1200 Fax: (718) 855-0760 Email: Peter_Kirchheimer@fd.org TERMINATED: 12/11/2017 LEAD ATTORNEY ATTORNEY TO BE NOTICED Designation: Public Defender or Community Defender Appointment

S. Isaac Wheeler Federal Defenders of New York, Inc. 52 Duane Street, 10th Floor New York, NY 10007 https://maps.google.com/?q=52+Duane+Street,+10th+Floor+%0D+New+York,+NY+10007&entry=gmail&source=g (212)417-8717 <(212)%20417-8717> Fax: (212)571-0392 <(212)%20571-0392> Email: isaac_wheeler@fd.org ATTORNEY TO BE NOTICED Designation: Public Defender or Community Defender Appointment

Pending Counts Disposition None

Highest Offense Level (Opening) None

Terminated Counts Disposition Attempted Money Laundering- Title 18, United States Code, Sections 1956(a)(3)(B), 2 and 3551 et seg.) (1) Dismissed on deft's motion. Theft of Public Property- Title 18, United States Code, Sections 641, 2 and 3551 et seq (2) Dismissed on deft's motion.

Highest Offense Level (Terminated) Felony

Complaints Disposition 18 USC 1956

Which is formatted as: [image: screen shot 2018-01-09 at 18 59 12] https://user-images.githubusercontent.com/1270718/34749406-3effc9cc-f56f-11e7-8b07-d0d8e07aa2eb.png

Nothing from Pending Counts to Complaints is handled by the RECAP system, and it all should be. Honestly, there's an argument that maybe the CL docket page should provide a link to the raw underlying HTML. Perhaps one day we'll be at a point where it is all parsed, but that day might be far off. Why not let users see it?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/freelawproject/recap/issues/230, or mute the thread https://github.com/notifications/unsubscribe-auth/AAT1OV1u5GFlyK1XpkFQ52s8SgMic6mBks5tI_67gaJpZM4RYn3- .

mlissner commented 6 years ago

Oh, we're storing all the HTML already. It's only a couple GB so far.

johnhawkinson commented 6 years ago

Yeah, like:

Saving all the HTML is probably unworkable.

Why? In what way?

We should just start parsing and storing this super interesting data. Criminal attorneys would probably LOVE to be able to search by these fields.

Well, yes, that's the point of this Issue ;)

johnhawkinson commented 6 years ago

That's an interesting idea. The problem is...which underlying HTML? We store every upload that we get...some may be incomplete, some may be old? Maybe we provide it all as a list for users to sort out themselves?

I don't think this is a "problem." People who want structured data will of course look to the parsers. You could just throw up the HTML in an apache autoindex directory, just like how the IA gives PDF+xml+torrent access, e.g. http://ia800101.us.archive.org/23/items/gov.uscourts.dcd.190596/:

screen shot 2018-01-10 at 12 53 56

Sure, you can apply more work and do better, but best is the enemy of good enough.

mlissner commented 6 years ago

No apache index pages for our users — they're too ugly and it's easy to do better — but I think we could do similar without too much hassle. I just worry that seeing 50 version of almost the same thing will confuse people.

Another important point: We aren't redacting the receipt info that's at the bottom of every page. Nuking that before posting it is a must or else you could figure out which people are pulling which data.

johnhawkinson commented 6 years ago

No apache index pages for our users — they're too ugly and it's easy to do better

Fair enough. But I would caution you that sometimes what we think is "better" actually… isn't. E.g. in https://github.com/freelawproject/recap/issues/195 there are halfadozen examples of things that the CL "pretty" docket page fails at that the IA docket page (not an autoindex) gets right. But several of them apply to the IA autoindex page, too.

"Good design is harder than it looks."

mlissner commented 6 years ago

Here's how I think this data will come together once scraped. This will hang off of a party in the docket JSON information:

                'criminal_data': {
                    'pending_counts': 'None',
                    'highest_offense_level_opening': 'None',
                    'highest_offense_level_terminated': 'Felony',
                    'counts': [{
                        'name': 'Attempted money laundering',
                        'disposition': '',
                        'status': 'pending',
                    }, {
                        'name': 'Theft of public property',
                        'disposition': 'Dismissed on deft's motion',
                        'status': 'terminated',
                    }],
                    'complaints': [{
                        'name': '18 USC 1956',
                        'disposition': '',
                    }],
                },

And on the CourtListener side, I think I'll hang this data as a table off of the party table.

In the front end, I'm a little confused by PACER. It shows something like:

Pending Counts Highest Offense Level (Opening) Terminated Counts Highest Offense Level (terminated) Complaints

Seems like a better layout would be:

Counts

Description Status Disposition
18:1326 Illegal Reentry Following Deportation Pending BOP 37 months followed by three yrs supervised release. Special assessment $100.00
Theft of Public Property- Title 18, United States Code, Sections 641, 2 and 3551 et seq Terminated Dismissed on deft's motion.

Highest Offense Level

Opening: Felony Terminated: None

Complaints: 1235: Reentry of deported alien

That seems a LOT better to me than splitting up the Counts and the offense level information. But maybe PACER did some user testing on this....

johnhawkinson commented 6 years ago

Well, I would encourage you to not gratuitously change the format from CMECF's just because it makes more sense to you.

From the perspective of the litigants, the terminated counts don't matter anymore. The pending counts and the current highest offense level matter, the historical ones aren't relevant for prosecuting the case or for sentencing, etc. So it makes a certain amount of sense to have the relevant stuff first and the irrelevant stuff last.

mlissner commented 6 years ago

Well, I would encourage you to not gratuitously change the format from CMECF's just because it makes more sense to you.

Agreed, which is why I was putting it out there how odd their design seemed.

mlissner commented 6 years ago

The full to do for this issue is:

For now I'm focused on 1-3 since search and historical data aren't a part of the current push. Hopefully the rest will come eventually.

mlissner commented 6 years ago

Recrawling is underway. Should be relatively fast. Once you filter down to dockets with parties and with 'cr' in their docket number, it's not that many. Only about 41k.

ds = Docket.objects.filter(parties__isnull=False, docket_number__icontains='cr', source__in=Docket.RECAP_SOURCES)
for d in ds:
    print("Doing %s" % d.pk)
    d.reprocess_recap_content()
mlissner commented 6 years ago

I'm closing this bug. The only remaining piece is to make this searchable, and I'm creating a separate issue for that.