freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
548 stars 150 forks source link

Plug gaps in disclosures from 2012 to 2019 #2032

Open mlissner opened 2 years ago

mlissner commented 2 years ago

Bill did a great analysis of this and produced the spreadsheet here:

https://docs.google.com/spreadsheets/d/1-QMHbpKMi0EoGVtFOpUQpd-R4m5-aAJOUgeG-EhMxW8/edit?usp=sharing

Lots of gaps. I'm forwarding this to the AO FDO to review. We'll see what they say and I'll keep track here.

mlissner commented 2 years ago

Just got a big data dump purporting to plug these holes. Bill is on it!

flooie commented 2 years ago

Some stats

~2,226 files (with some disclosures split over multiple files) representing ~2,220 disclosures 67 files were identified for new judges. A first pass and review identified 16 files for 10 disclosures that have not yet been imported successfully.

We haven't attempted to import the remaining new judges yet.

flooie commented 2 years ago

Additionally, I'm still reviewing but it seems like the AO incorrectly denied us disclosures in atleast 1 case.

flooie commented 2 years ago

Note to self, this should include updating the coverage page before being closed

mlissner commented 2 years ago

Is this done, @flooie ?

mlissner commented 1 year ago

I just re-did this to see how the FDO folks are doing. Not great bob!

Here's the code I ran:

from cl.disclosures.models import REPORT_TYPES, FinancialDisclosures
from itertools import groupby

def find_missing(lst):
    return sorted(set(range(lst[0], lst[-1])) - set(lst))

fds = (
     FinancialDisclosure.objects
       .only('person_id', 'year')
       .filter(year__gte=2011)  # Prior to this we have loads of random stuff
       .exclude(report_type=REPORT_TYPES.NOMINATION)
       .order_by('person_id', 'year')
 )

for key, group in groupby(fds, lambda x: x.person_id):
     p = Person.objects.get(pk=key)
     missing_years = find_missing(list(i.year for i in group))
     missing_str = ', '.join(str(i) for i in missing_years)
     if missing_years:
         print(f"{p.name_full}|{p.pk}|{missing_str}")

That gives some pretty good output that you can paste into a Google sheet and split into columns using Google's =SPLIT() formula.

The results are we're still missing about 120 disclosures, as listed in the second tab here:

https://docs.google.com/spreadsheets/d/1-QMHbpKMi0EoGVtFOpUQpd-R4m5-aAJOUgeG-EhMxW8/edit#gid=1335267140

I'll forward to the FDO, and on the wheel turns.