Data and scripts from the NYC budget.
The supporting schedules for each budget contain a treasure trove of data, down to the lowest level. [http://www.nyc.gov/html/omb/downloads/pdf/ss5_14.pdf]()
These PDFs are raw output from whatever budget system the city uses, and are very well structured making scraping much simpler.
All the data has already been downloaded and parsed, saved in this repository.
However, you will still need to run the csv2sqlite3.py
command below.
All data is downloaded from URLs of the scheme:
And so forth. Occasionally they are locked from text output. To unlock them, simply open in Chrome and save as a PDF under the same name.
You can then run:
$ pdftotext -layout data/raw/ss5_14.pdf &
$ pdftotext -layout data/raw/ss5_13.pdf &
And so forth for each PDF. This will save a text version in the same directory.
You will need xpdf in order to use
pdftotext
. If you're on mac, you should be able to install via brew:
brew install xpdf
python bin/parse.py data/raw/*.txt > data/processed/all.txt
You can use the bundled csv2sqlite3 to convert the CSV to sqlite.
python bin/csv2sqlite3.py data/processed/all.txt
Once you've created the local SQLite database, you can run queries on it.
$ sqlite3 data/processed/all.db
sqlite> select DESCRIPTION, SUM(value) from out where agency_name = 'DEPARTMENT OF EDUCATION' and key = '# POS' and budget_period = 'EXECUTIVE BUDGET FY15' and `inc/dec` is NULL GROUP BY DESCRIPTION;
description SUM(value)
---------------------------------------- --------------------
FULL TIME PEDAGOGICAL PRSONNEL 108959
FULL YEAR POSITIONS 10321
ULL TIME PEDAGOGICAL PRSONNEL 904
ULL YEAR POSITIONS 203
The Department of Education employs 75,000 teachers, although that count likely doesn't include many "full time pedagogical personnel".
select sum(value) from out where budget_period = 'EXECUTIVE BUDGET FY15' and `inc/dec` is null and key = 'AMOUNT';
sum(value)
----------------------------------------
74619472575
$74.6 billion is about where we should be!
sqlite> select agency_name, sum(value) from out where budget_period = 'EXECUTIVE BUDGET FY15' and `inc/dec` is null and key = 'AMOUNT' group by agency_name order by sum(value) DESC LIMIT 20;
agency_name sum(value)
---------------------------------------- --------------------
DEPARTMENT OF EDUCATION 20436893972
DEPARTMENT OF SOCIAL SERVICES 9620603993
MISCELLANEOUS 9490482227
PENSION CONTRIBUTIONS 8353527545
POLICE DEPARTMENT 4703668663
DEBT SERVICE 4257632151
ADMIN FOR CHILDREN'S SERVICES 2810722817
FIRE DEPARTMENT 1765803243
DEPARTMENT OF SANITATION 1483639811
DEPARTMENT OF HEALTH AND MENTAL HYGIENE 1338227487
DEPARTMENT OF CITYWIDE ADMIN SERVICE 1115030179
DEPARTMENT OF ENVIRONMENTAL PROTECT. 1092604893
DEPARTMENT OF CORRECTION 1060259177
DEPARTMENT OF HOMELESS SERVICES 951878443
CITY UNIVERSITY OF NEW YORK 914985409
DEPARTMENT OF TRANSPORTATION 800629441
HOUSING PRESERVATION AND DEVELOPMENT 512759197
DEPARTMENT OF INFO TECH & TELECOMM 452962565
DEPARTMENT OF PARKS AND RECREATION 355165362
DEPARTMENT OF YOUTH & COMMUNITY DEV 340150536
The Department of Education claims a $24 billion budget, although a significant portion of that could come from the state and federal government.
The Department of Social Services had a $9.3 billion budget in 2013, so $9.6 billion sounds about right.
The Police Department's budget has been in the realm of $3.6 billion, so $4.7 seems a bit excessive.
When querying, make sure to include budget_period = 'EXECUTIVE BUDGET FY15 and
inc/decIS NULL
in order to query the most recent budget for absolute
numbers, rather than increases/decreases.
You'll also want to limit by key = '# POS'
if you're looking up employment
counts, key = 'AMOUNT'
if you're looking up spending, and key = #CNTRCT
if you're looking up numbers of contracts.
sqlite> select distinct key from out;
# POS
AMOUNT
# CNTRCT
How much does NYC plan to spend on books next year?
sqlite> select sum(value) from out where DESCRIPTION LIKE '%BOOKS%' AND key='AMOUNT' and budget_period = 'EXECUTIVE BUDGET FY15' and `inc/dec` is NULL;
sum(value)
----------------------------------------
142469594
$142M on books! Who in the city loves to read so much?
sqlite> select agency_name, budget_code_name, description, value from out where DESCRIPTION LIKE '%BOOKS%' AND key='AMOUNT' and budget_period = 'EXECUTIVE BUDGET FY15' and `inc/dec` is NULL order by value desc limit 10;
agency_name budget_code_name description value
------------------------- ----------------------------------- --------------- ---------------
DEPARTMENT OF EDUCATION NYSTL - ELEMENTARY / MIDDLE BOOKS-OTHER 62135782
DEPARTMENT OF EDUCATION NON-PUBLIC SCHOOL PAYMENTS BOOKS-OTHER 16247770
DEPARTMENT OF EDUCATION NYSTL - HIGH SCHOOL BOOKS-OTHER 10685076
DEPARTMENT OF EDUCATION NYSTL - ELEMENTARY / MIDDLE LIBRARY BOOKS 7758692
DEPARTMENT OF EDUCATION REIMBURSEABLE SUPPORT-GE INST ELE/M BOOKS-OTHER 6842846
DEPARTMENT OF EDUCATION GE INST & SCHOOL SUPERVISION ELE/MI BOOKS-OTHER 6355113
DEPARTMENT OF EDUCATION GE INSTRUCTION & SCHOOL SUPERVISION BOOKS-OTHER 4698615
DEPARTMENT OF EDUCATION GE Central Managed Sch Supp-HS BOOKS-OTHER 3083980
DEPARTMENT OF EDUCATION GE HOLDING CODE - ELEMENTARY / MIDD BOOKS-OTHER 2803235
DEPARTMENT OF EDUCATION NON-PUBLIC SCHOOL PAYMENTS LIBRARY BOOKS 2069760
Elementary and middle schoolers, apparently.
If you want to create your own API, you need to import the data to your postgres database. The command you need to run is:
pg_restore --verbose -C --no-owner -h <YOUR_IP_ADDRESS> -p 5432 --jobs=4 -U <YOUR_DB_USERNAME> -d <YOUR_DB_NAME> data/processed/alladopted.dump