Update Case Query Docket Number Parsing

flooie commented 3 months ago

This is a follow up issue from a conversation I had with @albertisfu

We need to update juriscraper to take advantage of the new fields we are adding to the docket class. They are as follows (for now):

federal_dn_office_code
federal_dn_case_type
federal_dn_judge_initials_assigned
federal_dn_judge_initials_referred
federal_defendant_number

I've compiled an extensive list of docket number edge cases for district and a few for bankruptcy to test against and I hope I found a good regex pattern to parse out the information.

Bankruptcy
02-00017-LMK
1:24-bk-10757

District
1:01-cv-00570-PCH
1:01-cv-00570-PCH
2:20-mc-00021-JES-M_M
2:20-mc-00021-JES-g_g
1:20-cv-00021-GBD-SLC
1:20-cr-00033-CJW-MAR
2:20-sw-00156-tmp
3:20-cr-00061-TMB-MMS
1:20-mj-00061-N
1:20-mj-00061-N-1
1:20-mj-00061-N-2
1:20-cr-00061-KD
1:20-cr-00060-CG-N
4:20-cv-00059-AW-MJF
3:20-cv-00059-MCR-GRJ
2:20-mj-00061-MHB
1:20-cv-00120-WJM-KMT
3:20-cv-00021-GFVT-EBA
8:20-cr-00006-DOC
2:16-CM-27244-CMR
2:16-PV-27244-CMR
2:16-AL-27244-CMR
2:16-a2-27244-CMR
3:21-~gr-00001
3:21-y-00001
1:21-2255-00001
1:21-MDL-00001
1:21-adc-00001
1:21-crcor-00001
2:24-gj-00075-JS-1
3:20-cr-00070-TKW-MAL-1
3:20-cr-00070-TKW-2

Bad District Examples
4:20-mj-00061-N/A <-- N/A here stands for not assigned
4:20-cv-00061-CKJ-PSOT <-- PSOT stands for Pro Se ... Tuscon

I wrote this simple function to test it.

def _parse_dn_components(self, potential_docket_numbers):
        regex = r"(?P<federal_dn_office_code>\d):\d{2}-(?P<federal_dn_case_type>[a-zA-Z0-9]{1,5}|~gr)-\d{5}(?:-(?P<federal_dn_judge_initials_assigned>[a-zA-Z_]{1,5}))?(?:-(?P<federal_dn_judge_initials_referred>[a-zA-Z_]{1,5}))?(?:-(?P<federal_defendant_number>\d))?"
        match = re.match(regex, potential_docket_numbers)
        if match:
            return match.groupdict()

mlissner commented 3 months ago

Nice one. I'll put this onto Alberto's backlog.

albertisfu commented 3 months ago

Great! I have a couple of questions so far:

The current docket_number parsing method and the way it is stored in CL won't change, correct?

I mean, currently, if we have 1:01-cv-00570-PCH, the parsed docket_number is 1:01-cv-00570. So, no change here, right?

The only change is that in Juriscraper, we will now return the following fields:

{
'federal_dn_office_code': '3',
 'federal_dn_case_type': 'cr',
 'federal_dn_judge_initials_assigned': 'TMB',
 'federal_dn_judge_initials_referred': 'MMS',
 'federal_defendant_number': None
}

And these fields will be stored in the model.

Regarding this bankruptcy example: 02-00017-LMK is currently not being parsed by the regex. I can tweak it to support or create a version for bankruptcy. Just to confirm, does LMK correspond to federal_dn_judge_initials_assigned or a different field?

Regarding the bad examples:

4:20-mj-00061-N/A <-- N/A here stands for not assigned. In this case, should the returned federal_dn_judge_initials_assigned be N/A or None?
4:20-cv-00061-CKJ-PSOT <-- PSOT stands for Pro Se ... Tucson. In this case, should the returned federal_dn_judge_initials_referred be PSOT or None?

mlissner commented 3 months ago

I mean, currently, if we have 1:01-cv-00570-PCH, the parsed docket_number is 1:01-cv-00570. So, no change here, right?

Right.

The only change is that in Juriscraper, we will now return the following fields...

Right.

does LMK correspond to federal_dn_judge_initials_assigned

I assume it's the assigned judge initials, but @flooie will know for sure.

[Should it be] N/A or None

None (or, rather, blank, "", right?)

In this case, should the returned federal_dn_judge_initials_referred be PSOT or None?

This is a good question. I'm inclined not to special case this. Who knows what other junk some court might put in some day. I think it's better to capture it as the referred to initials, and folks who work in that jurisdiction are probably used to this bit of confusion.

I'd want to avoid trying to identify all the possible weird ideas courts have ever or will ever come up with.

flooie commented 3 months ago

I think we should capture the two sets of initials and just store them. they are uncommon - and even thought N/A is not a set of initials having it will let us re-create the court full docket number as they represent it.

Same for the PSOT - they use the term - and we can just capture it and it's uncommon and shouldn't cause us any issue really.

albertisfu commented 3 months ago

Thanks! I'm starting to work on this. I'll let you know if more questions arise.

freelawproject / juriscraper

Update Case Query Docket Number Parsing #1101