Link directly to 990 PDFs

detroitledger / gnl_profile

API & data management system for the Detroit Ledger

https://www.detroitledger.org

0 stars 2 forks source link

Link directly to 990 PDFs #160

Open hampelm opened 7 years ago

hampelm commented 7 years ago

As much as we dislike 'em, the 990 PDFs aren't going away. Since they provide so much info and are behind a login wall in many places, it'd be helpful to link directly to them on organization pages.

To do:

[ ] Find a stable source of them: either clone the S3 bucket or find someone we trust who does (the ProPublica terms might not work for us, but I can reach out them: https://projects.propublica.org/nonprofits/)
[ ] Identify a mapping of EIN => 990 by year
[ ] Import that into our database
[ ] Include an array of 990 links sorted by year in our org response. I envisions something like:


{
  ...org details...
  990s: [{ year: 2015, url: 'https://s3...'}, ...]
}

hampelm commented 7 years ago

The Internet Archive has 900+ ISOs of filings from them IRS organized by date and type: https://archive.org/details/IRS990?sort=-publicdate

I opened up a sample. Each contains the PDFs plus a manifest that has the file path, EIN, org name, filing type, date, and other metadata in a tab-delimited manifest file.

hampelm commented 7 years ago

I'm hopeful that someone already has these on S3. Otherwise, the process of scripting this won't be that hard and it'll cost us about $1/month to host. We'll just have to find a good way to index them; seems like preserving the existing structure is the most straightforward (year+type/ein+year+type.pdf), and store the lookup in a single flat table

hampelm commented 7 years ago

Following the instructions here: https://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/

Do an advanced search and ask for a CSV: https://archive.org/advancedsearch.php?q=collection:IRS990

AWS machine for processing

ssh ec2-user@54.236.32.95 -i ~/.ssh/matth.pem

ebs mounted at /data

hampelm commented 7 years ago

Here's the list of 990 uploads on the archive: https://gist.github.com/hampelm/c5e22d1ac19bea8fd57b44aee4f09962

Work-in-progress wget command to capture a single one:

wget -r -H -nc -np -nH --cut-dirs=1 -e robots=off -l1 -A "*.iso" https://archive.org/download/IRS990-2010-09

Probably want to add a column called "s3path" to the file to define where each one will be uploaded on s3, since the directory paths vary

Downloads are running pretty slow (2-3MB/s on EC2) so this first part will take a while; next step will be to mount the ISOs with something lke

sudo mount -o loop whatever.iso /mnt/iso