Open hampelm opened 7 years ago
The Internet Archive has 900+ ISOs of filings from them IRS organized by date and type: https://archive.org/details/IRS990?sort=-publicdate
I opened up a sample. Each contains the PDFs plus a manifest that has the file path, EIN, org name, filing type, date, and other metadata in a tab-delimited manifest file.
I'm hopeful that someone already has these on S3. Otherwise, the process of scripting this won't be that hard and it'll cost us about $1/month to host. We'll just have to find a good way to index them; seems like preserving the existing structure is the most straightforward (year+type/ein+year+type.pdf), and store the lookup in a single flat table
Following the instructions here: https://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/
Do an advanced search and ask for a CSV: https://archive.org/advancedsearch.php?q=collection:IRS990
AWS machine for processing
ssh ec2-user@54.236.32.95 -i ~/.ssh/matth.pem
ebs mounted at /data
Here's the list of 990 uploads on the archive: https://gist.github.com/hampelm/c5e22d1ac19bea8fd57b44aee4f09962
Work-in-progress wget command to capture a single one:
wget -r -H -nc -np -nH --cut-dirs=1 -e robots=off -l1 -A "*.iso" https://archive.org/download/IRS990-2010-09
Probably want to add a column called "s3path" to the file to define where each one will be uploaded on s3, since the directory paths vary
Downloads are running pretty slow (2-3MB/s on EC2) so this first part will take a while; next step will be to mount the ISOs with something lke
sudo mount -o loop whatever.iso /mnt/iso
As much as we dislike 'em, the 990 PDFs aren't going away. Since they provide so much info and are behind a login wall in many places, it'd be helpful to link directly to them on organization pages.
To do: