freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
361 stars 106 forks source link

Update Case Query Docket Number Parsing #1101

Closed flooie closed 1 month ago

flooie commented 2 months ago

This is a follow up issue from a conversation I had with @albertisfu

We need to update juriscraper to take advantage of the new fields we are adding to the docket class. They are as follows (for now):

I've compiled an extensive list of docket number edge cases for district and a few for bankruptcy to test against and I hope I found a good regex pattern to parse out the information.

Bankruptcy
02-00017-LMK
1:24-bk-10757

District
1:01-cv-00570-PCH
1:01-cv-00570-PCH
2:20-mc-00021-JES-M_M
2:20-mc-00021-JES-g_g
1:20-cv-00021-GBD-SLC
1:20-cr-00033-CJW-MAR
2:20-sw-00156-tmp
3:20-cr-00061-TMB-MMS
1:20-mj-00061-N
1:20-mj-00061-N-1
1:20-mj-00061-N-2
1:20-cr-00061-KD
1:20-cr-00060-CG-N
4:20-cv-00059-AW-MJF
3:20-cv-00059-MCR-GRJ
2:20-mj-00061-MHB
1:20-cv-00120-WJM-KMT
3:20-cv-00021-GFVT-EBA
8:20-cr-00006-DOC
2:16-CM-27244-CMR
2:16-PV-27244-CMR
2:16-AL-27244-CMR
2:16-a2-27244-CMR
3:21-~gr-00001
3:21-y-00001
1:21-2255-00001
1:21-MDL-00001
1:21-adc-00001
1:21-crcor-00001
2:24-gj-00075-JS-1
3:20-cr-00070-TKW-MAL-1
3:20-cr-00070-TKW-2

Bad District Examples
4:20-mj-00061-N/A <-- N/A here stands for not assigned
4:20-cv-00061-CKJ-PSOT <-- PSOT stands for Pro Se ... Tuscon

I wrote this simple function to test it.

def _parse_dn_components(self, potential_docket_numbers):
        regex = r"(?P<federal_dn_office_code>\d):\d{2}-(?P<federal_dn_case_type>[a-zA-Z0-9]{1,5}|~gr)-\d{5}(?:-(?P<federal_dn_judge_initials_assigned>[a-zA-Z_]{1,5}))?(?:-(?P<federal_dn_judge_initials_referred>[a-zA-Z_]{1,5}))?(?:-(?P<federal_defendant_number>\d))?"
        match = re.match(regex, potential_docket_numbers)
        if match:
            return match.groupdict()
mlissner commented 2 months ago

Nice one. I'll put this onto Alberto's backlog.

albertisfu commented 2 months ago

Great! I have a couple of questions so far:

The current docket_number parsing method and the way it is stored in CL won't change, correct?

I mean, currently, if we have 1:01-cv-00570-PCH, the parsed docket_number is 1:01-cv-00570. So, no change here, right?

The only change is that in Juriscraper, we will now return the following fields:

{
'federal_dn_office_code': '3',
 'federal_dn_case_type': 'cr',
 'federal_dn_judge_initials_assigned': 'TMB',
 'federal_dn_judge_initials_referred': 'MMS',
 'federal_defendant_number': None
}

And these fields will be stored in the model.

Regarding the bad examples:

mlissner commented 2 months ago

I mean, currently, if we have 1:01-cv-00570-PCH, the parsed docket_number is 1:01-cv-00570. So, no change here, right?

Right.

The only change is that in Juriscraper, we will now return the following fields...

Right.

does LMK correspond to federal_dn_judge_initials_assigned

I assume it's the assigned judge initials, but @flooie will know for sure.

[Should it be] N/A or None

None (or, rather, blank, "", right?)

In this case, should the returned federal_dn_judge_initials_referred be PSOT or None?

This is a good question. I'm inclined not to special case this. Who knows what other junk some court might put in some day. I think it's better to capture it as the referred to initials, and folks who work in that jurisdiction are probably used to this bit of confusion.

I'd want to avoid trying to identify all the possible weird ideas courts have ever or will ever come up with.

flooie commented 2 months ago

I think we should capture the two sets of initials and just store them. they are uncommon - and even thought N/A is not a set of initials having it will let us re-create the court full docket number as they represent it.

Same for the PSOT - they use the term - and we can just capture it and it's uncommon and shouldn't cause us any issue really.

albertisfu commented 2 months ago

Thanks! I'm starting to work on this. I'll let you know if more questions arise.