freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
542 stars 150 forks source link

Look into free PACER docs already in the DB. #709

Closed mlissner closed 7 years ago

mlissner commented 7 years ago

As discussed in https://github.com/freelawproject/courtlistener/issues/657#issuecomment-313547723, when our downloader of free pacer documents finished, it had 109k items that it couldn't download because they were already in our DB:

PACERFreeDocumentRow.objects.filter(error_msg__startswith='Found the item').count()
108530

These need to be reviewed and if they're valid we can:

  1. Remove the PACERFreeDocumentRow objects.
  2. Update the item in the DB to say that it was free.
  3. Tweak the code to update items in the DB in the future.
mlissner commented 7 years ago

This turned out to be quite easy. I did a quick test of the items above:

from collections import Counter
c = Counter()
for row in rows:
    rd = RECAPDocument.objects.filter(
        pacer_doc_id=row.pacer_doc_id,
        docket_entry__docket__pacer_case_id=row.pacer_case_id,      
        docket_entry__docket__court_id=map_pacer_to_cl_id(row.court_id)
    )
    count = rd.count()
    if count == 1:
        c.update(['one'])
    elif count > 1:
        c.update(['more'])
    elif count < 1:
        c.update(['zero'])

Which created the following stats:

Counter({'one': 103850, 'zero': 4680})

Right now I'm processing the 103k with:

for row in rows:
    rd = RECAPDocument.objects.filter(
        pacer_doc_id=row.pacer_doc_id,
        docket_entry__docket__pacer_case_id=row.pacer_case_id,
        docket_entry__docket__court_id=map_pacer_to_cl_id(row.court_id)
    )
    count = rd.count()
    if count == 1:
        rd[0].is_free_on_pacer = True
        rd[0].save()
        row.delete()

When that's done, I'll dig into the remaining 4680 items.

mlissner commented 7 years ago

Another 4620 items are now resolved in the same manner as above, except that these items didn't match up at first because they have slightly different pacer_doc_id values. The fourth digit of the pacer_doc_id value can be either a zero or a one according to some boolean in the system. In these last items, it was different in the download than in the item already in the DB, so they didn't match at first. The way around this is:

for row in rows:                                                                  
    rd = RECAPDocument.objects.filter(
        Q(docket_entry__docket__pacer_case_id=row.pacer_case_id) | Q(docket_entry__docket__docket_number=row.docket_number),
        docket_entry__docket__court_id=map_pacer_to_cl_id(row.court_id), 
        pacer_doc_id__endswith=row.pacer_doc_id[4:],
    )
    count = rd.count()
    if count == 1:
        rd[0].is_free_on_pacer = True
        rd[0].save()
        row.delete()

Now there are 60 items remaining with this issue...getting there.

mlissner commented 7 years ago

The remaining items here seem to be combined dockets, which we don't support yet. I'm going to delete these items, put a log of them here, and close this issue.

for row in rows:
     print "%s, %s, %s, %s, %s, %s" % (row.court_id, row.docket_number, row.case_name, row.pacer_case_id, row.pacer_doc_id, row.document_number)

scd, 3:13-cr-01020-JFA, United States v. Wright, 206077, 16316764100, 2
vawd, 7:10-cr-00054-SGW, United States v. Corbett, 78080, 19111442860, 5
wvsd, 2:12-cr-00119, United States v. Spinks, 90075, 20112417185, 3
cacd, 2:09-cr-00671-CAS, United States v. Burgos-Hernandez, 448882, 03118399633, 6
caed, 2:13-cr-00086-GEB, United States v. Wymer, 251538, 03316566834, 4
caed, 1:05-cr-00435-AWI, United States v. White, 142850, 0331477001, 4
cacd, 2:11-cr-00307-DMG, United States v. Trujillo, 498924, 031112090406, 7
ncwd, 3:06-cr-00151-FDW, United States v. Pileggi, 45634, 1351273066, 2
deb, 08-13141-KJC, Tribune Media Company, Reorganized Debtors, 115179, 042011846587, 10133
deb, 08-13141-KJC, Tribune Media Company, Reorganized Debtors, 115179, 042011840112, 10134
vawd, 7:10-cr-00054-SGW, United States v. Corbett, 78081, 19111442182, 5
wawd, 2:11-cr-00111-RSL, United States v. Pham, 174898, 19714136977, 10
tnmd, 3:12-cr-00137, United States v. Martin, 53599, 16912001529, 6
tnmd, 3:14-cr-00037, United States v. Zapien, 59033, 16912565797, 13
tnmd, 3:14-cr-00037, United States v. Zapien, 59034, 16912565686, 13
caeb, 13-29030, William Cheng and Janet Cheng, 528222, 032021621182, 783
caeb, 13-29030, William Cheng and Janet Cheng, 528222, 032021621188, 783
tnmd, 3:14-cr-00037, United States v. Zapien, 59034, 16912565651, 2
tnmd, 3:14-cr-00037, United States v. Zapien, 59035, 16912565819, 2
tnmd, 3:14-cr-00037, United States v. Zapien, 59034, 16912565663, 6
tnmd, 3:14-cr-00037, United States v. Zapien, 59035, 16912565831, 6
scd, 3:13-cr-01020-JFA, United States v. Wright, 206079, 16316763946, 2
scd, 3:13-cr-01020-JFA, United States v. Wright, 206076, 16316764152, 2
ncwd, 3:06-cr-00151-FDW, United States v. Pileggi, 45632, 1351690917, 2
ncwd, 3:05-cr-00400-FDW, United States v. Cummins, 46159, 1351263612, 2
ncwd, 3:07-cr-00119-FDW-DCK, United States v. Ligator, 49201, 1351383844, 4
ncwd, 3:07-cr-00119-FDW-DCK, United States v. Ligator, 49203, 1351383943, 4
azd, 2:12-cr-02066-SRB, United States v. Pineda-Bustos, 745657, 025110203528, 6
wawd, 2:14-cr-00059-RSL, United States v. Lundy, 199281, 19715680590, 8
mied, 2:10-cr-20403-NGE-MKM, United States v. Kilpatrick, 254557, 09715841089, 181
mied, 2:10-cr-20403-NGE-MKM, United States v. Kilpatrick, 254557, 09715840614, 180
mied, 2:10-cr-20403-NGE-MKM, United States v. Kilpatrick, 254557, 09715841178, 182
mied, 2:10-cr-20403-NGE-MKM, United States v. Kilpatrick, 254557, 09715844260, 184
mied, 2:10-cr-20403-NGE-MKM, United States v. Kilpatrick, 254557, 09715854253, 197
caed, 2:09-cr-00244-WBS, United States v. Alvarez Ramirez, 192934, 03313327185, 5
caed, 2:09-cr-00244-WBS, United States v. Alvarez Ramirez, 192936, 03313327163, 5
mied, 2:10-cr-20403-NGE-MKM, United States v. Kilpatrick, 254555, 09716335079, 305
ohsd, 3:15-cr-00128-TMR, United States v. Gray, 187869, 14315707948, 13
ohsd, 3:15-cr-00128-TMR, United States v. Gray, 187870, 14315707902, 13
ohsd, 3:15-cr-00127-TMR, United States v. Traum, 187856, 14315707535, 14
ncwd, 3:06-cr-00151-FDW-DCK, United States v. Pileggi, 45637, 1351272929, 2
ncwd, 3:06-cr-00151-FDW-DCK, United States v. Pileggi, 45628, 1351272604, 2
ncwd, 3:06-cr-00151-FDW-DCK, United States v. Pileggi, 45637, 1351690562, 5
ncwd, 3:06-cr-00151-FDW-DCK, United States v. Pileggi, 45637, 1351272567, 6
wvsd, 2:12-cr-00119, United States v. Spinks, 90076, 20112417171, 3
nyed, 1:11-cr-00623-DLI, United States v. Hasbajrami, 321830, 123110061572, 85
nyed, 1:11-cr-00623-DLI, United States v. Hasbajrami, 321830, 123111683201, 165
cacd, 2:08-cr-01011-VBF, United States v. Sarmiento, 424184, 03116507665, 5
cacd, 2:09-cr-00671-CAS, United States v. Burgos-Hernandez, 448884, 03118400294, 6
caed, 2:11-cr-00190-MCE, United States v. Ramirez, 223062, 03315844695, 6
caed, 2:14-cr-00276-JAM, United States v. Silva-Soto, 273495, 03317708909, 6
caed, 2:15-cr-00115-TLN, United States v. Khamkeuanekeo, 282130, 03318178476, 5
caed, 2:15-cr-00115-TLN, United States v. Khamkeuanekeo, 282130, 03318178491, 10
caed, 2:15-cr-00115-TLN, United States v. Khamkeuanekeo, 282130, 03318178496, 12
caed, 2:15-cr-00115-TLN, United States v. Khamkeuanekeo, 282133, 03318178304, 5
caed, 2:15-cr-00115-TLN, United States v. Khamkeuanekeo, 282135, 03318178214, 5
caed, 2:16-cr-00025-TLN, United States v. Velazquez, 290957, 03318660159, 7
vawd, 7:10-cr-00054-SGW, United States v. Corbett, 78081, 19111442560, 5
tnmd, 3:14-cr-00037, United States v. Zapien, 59032, 16912565740, 13
dcd, 1:13-cr-00253-RWR, United States v. CLASS, 161854, 04514688406, 76
mlissner commented 7 years ago

Found another mess of these. These errors were being caused because we do not normalize docket numbers completely or consistently. For example, these two docket numbers are different:

In [101]: d.docket_number Out[101]: u'6:14-ap-00176'

In [102]: row0.docket_number Out[102]: u'6:14-ap-00176-KSJ'

Whether they actually represent the same docket, I couldn't tell you, but in our DB they come as different. Where this gets messy is that we're inconsistent in the way we assign docket entries to dockets. Do we normalize the docket number or not? I think we probably shouldn't, but I think we have in general in the past.

That's a battle for another day. The point now was that we already had these documents in our system (as matched on their pacer_doc_id), so I just marked them as free and called it GOOD ENOUGH. It was only a few hundred of them, and I don't have a good way to resolve the docket confusion question atm.

Done here.