detroitledger / gnl_profile

API & data management system for the Detroit Ledger
https://www.detroitledger.org
0 stars 2 forks source link

Importing bulk IRS data #138

Open hampelm opened 8 years ago

hampelm commented 8 years ago

This statement creates the unified 990s table:

create table form (
id bigserial primary key,
created_at timestamp default current_timestamp,
updated_at timestamp default current_timestamp,
source varchar(50),
org bigint,
ein varchar(50),
subseccd varchar(50),
pdf varchar(255),
filing_type varchar(50),
start_year int,
end_year int,
irs_year int,
filing_date varchar(25),
tax_period varchar(25),
contributions_and_grants bigint,
program_service_revenue bigint,
investment_income bigint,
other_revenue bigint,
total_revenue bigint,
grants_paid bigint,
benefits_paid bigint,
compensation bigint,
fundraising_fees bigint,
total_fundraising_expenses bigint,
other_expenses bigint,
total_expenses bigint,
revenue_less_expenses bigint,
total_assets bigint,
total_liabilities bigint,
net_assets bigint,
data jsonb
);

CREATE INDEX ein_idx ON combined (ein);
hampelm commented 8 years ago

To import:

A. Data downloads are at https://www.irs.gov/uac/soi-tax-stats-annual-extract-of-tax-exempt-organization-financial-data

B. Get desired file:

wget https://www.irs.gov/pub/irs-soi/12eofinextract990.zip

C. Extract file

unzip 12eofinextract990.zip

D. Generate schema for file:

csvsql -i postgresql py12_990.dat -d ' ' > py12_990.sql

This may give a segmentation fault for large files. It's probably running out of memory; try it on your personal computer if this happens.

E. Create a table using that statement:

psql postgres < py12_990.sql

F. Import the data into postgres:

psql postgres
postgres=#  copy py12_990 from '/tmp/py12_990.dat' delimiter ' ' csv header;
jessicamcinchak commented 8 years ago

Note that 2015 extracts are linked directly as .dat (not zipped). Adjust import by skipping step C.

Revise step D: csvsql -i postgresql 15eofinextract990pf.dat -d ' ' > py15_990pf.sql

Once imported, rename 2015 990 tables for consistency: postgres=# ALTER TABLE "15eofinextract990pf" RENAME TO "py15_990pf";

hampelm commented 8 years ago

Here are the statements I've come up with to standardize the 2012 data. I think I was overzealous in the schema and we can remove a lot of fields, but I'll think about that later.

To test these, I created a test table (matt, you should create one for yourself using the schema above).

You'll notice that these seem pretty redundant. Why select from a select statement? That's a quirk of postgres/SQL: if instead we did the much cleaner

insert into matt 
(ein, subseccd)
select subseccd as subseccd, ein as ein 
from 2012_data

the insert wouldn't save subseccd to the subseccd column of matt. Instead, it'd insert the data in the order of the columns in matt, no matter the column names. That's a recipe for hard-to-find bugs. This level of indirection means we can write the fields in any order.

Here goes -- you can use these as templates for doing the other tables:

2012 990s

insert into matt
(source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue, compensation,
fundraising_fees, total_expenses, total_assets, total_liabilities,
net_assets, data)
select
source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue, compensation,
fundraising_fees, total_expenses, total_assets, total_liabilities,
net_assets, data
from (select
'2012 990' as source,
subseccd,
"EIN" as ein,
'990' as filing_type,
tax_prd as tax_period,
totcntrbgfts as contributions_and_grants,
totprgmrevnue as program_service_revenue,
invstmntinc as investment_income,
totrevenue as total_revenue,
othrsalwages + compnsatncurrofcr as compensation,
profndraising as fundraising_fees,
totfuncexpns as total_expenses,
totassetsend as total_assets,
totliabend as total_liabilities,
totnetassetend as net_assets,
row_to_json(py12_990) as data
from py12_990 ) as stuff
limit 10

PY12 990 EZ

insert into matt
(source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue,
total_expenses,
total_assets, total_liabilities,
net_assets, data)
select
source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue,
 total_expenses,
total_assets, total_liabilities,
net_assets, data
from (select
'PY12 990 EZ' as source,
subseccd,
ein,
'990ez' as filing_type,
tax_prd as tax_period,
totcntrbs as contributions_and_grants,
prgmservrev as program_service_revenue,
othrinvstinc as investment_income,
-- othrsalwages + compnsatncurrofcr as compensation,
-- profndraising as fundraising_fees,
totrevnue as total_revenue,
totexpns as total_expenses,
totassetsend as total_assets,
totliabend as total_liabilities,
totnetassetsend as net_assets,
row_to_json(py12_990ez) as data
from py12_990ez ) as stuff
limit 10

2012 PF

insert into matt
(
source,
ein,
subseccd,
irs_year,
filing_type,
tax_period,
contributions_and_grants,
compensation,
investment_income,
total_revenue,
total_expenses,
total_assets,
total_liabilities,
grants_paid,
data
)
select
source,
ein,
subseccd,
irs_year,
filing_type,
tax_period,
contributions_and_grants,
compensation,
investment_income,
total_revenue,
total_expenses,
total_assets,
total_liabilities,
grants_paid,
data
from (select
'2012 990PF' as source,
"SUBCD" as subseccd,
"EIN" as ein,
'990pf' as filing_type,
"TAX_PRD" as tax_period,
"TAX_YR" as irs_year,
"GRSCONTRGIFTS" as contributions_and_grants,
-- prgmservrev as program_service_revenue,
"NETINVSTINC" as investment_income,
"COMPOFFICERS" as compensation,
-- profndraising as fundraising_fees,
"TOTRCPTPERBKS" as total_revenue,
"TOTEXPNSPBKS" as total_expenses,
"TOTASSETSEND" as total_assets,
"TOTLIABEND" as total_liabilities,
"CONTRPDPBKS" as grants_paid,
-- totnetassetsend as net_assets,
row_to_json(py12_990pf) as data
from py12_990pf) as stuff
limit 10
bnchdrff commented 8 years ago

schemas for 2013, 2014, and 2015 990: https://gist.github.com/bnchdrff/14413f0bf8253729bd646373d91519a0

hampelm commented 8 years ago

2013 990

insert into matt
(source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue, compensation,
fundraising_fees, total_expenses, total_assets, total_liabilities,
net_assets, data)
select
source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue, compensation,
fundraising_fees, total_expenses, total_assets, total_liabilities,
net_assets, data
from (select
'2013 990' as source, 
subseccd,
"EIN" as ein,
'990' as filing_type,
tax_pd as tax_period,
totcntrbgfts as contributions_and_grants,
totprgmrevnue as program_service_revenue,
invstmntinc as investment_income,
totrevenue as total_revenue,
othrsalwages + compnsatnandothr + compnsatncurrofcr as compensation,
profndraising as fundraising_fees,
totfuncexpns as total_expenses,
totassetsend as total_assets,
totliabend as total_liabilities,
totnetassetend as net_assets,
row_to_json(py13_990) as data
from py13_990 ) as stuff
limit 10
hampelm commented 8 years ago

2013 EZ

insert into matt
(source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue,
total_expenses,
total_assets, total_liabilities,
net_assets, data)
select
source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue,
 total_expenses,
total_assets, total_liabilities,
net_assets, data
from (select
'2013 EZ' as source, 
subseccd,
"EIN" as ein,
'990ez' as filing_type,
tax_pd as tax_period,
totcntrbs as contributions_and_grants,
prgmservrev as program_service_revenue,
othrinvstinc as investment_income,
-- othrsalwages + compnsatncurrofcr as compensation,
-- profndraising as fundraising_fees,
totrevnue as total_revenue,
totexpns as total_expenses,
totassetsend as total_assets,
totliabend as total_liabilities,
totnetassetsend as net_assets,
row_to_json("py13_EZ") as data
from "py13_EZ" ) as stuff
limit 10
hampelm commented 8 years ago

2013 PF

insert into matt
(
source, 
ein,
subseccd,
irs_year,
filing_type,
tax_period,
contributions_and_grants,
compensation,
investment_income,
total_revenue,
total_expenses,
total_assets,
total_liabilities,
grants_paid,
data
)
select
source, 
ein,
subseccd,
irs_year,
filing_type,
tax_period,
contributions_and_grants,
compensation,
investment_income,
total_revenue,
total_expenses,
total_assets,
total_liabilities,
grants_paid,
data
from (select
'2013 PF' as source, 
"SUBCD" as subseccd,
"EIN" as ein,
'990ez' as filing_type,
"TAX_PRD" as tax_period,
"TAX_YR" as irs_year,
"GRSCONTRGIFTS" as contributions_and_grants,
-- prgmservrev as program_service_revenue,
"NETINVSTINC" as investment_income,
"COMPOFFICERS" as compensation,
-- profndraising as fundraising_fees,
"TOTRCPTPERBKS" as total_revenue,
"TOTEXPNSPBKS" as total_expenses,
"TOTASSETSEND" as total_assets,
"TOTLIABEND" as total_liabilities,
"CONTRPDPBKS" as grants_paid,
-- totnetassetsend as net_assets,
row_to_json(py13_990pf) as data
from py13_990pf) as stuff
limit 10 
jessicamcinchak commented 8 years ago

2014 990

insert into matt
(
source, 
ein, 
subseccd, 
filing_type, 
tax_period,
contributions_and_grants, 
program_service_revenue,
investment_income, 
total_revenue, 
compensation,
fundraising_fees, 
total_expenses, 
total_assets, 
total_liabilities,
net_assets, 
data
)
select
source, 
ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue, compensation,
fundraising_fees, total_expenses, total_assets, total_liabilities,
net_assets, data
from (select
'2014 990' as source, 
subseccd,
"EIN" as ein,
'990' as filing_type,
tax_prd as tax_period,
totcntrbgfts as contributions_and_grants,
totprgmrevnue as program_service_revenue,
invstmntinc as investment_income,
totrevenue as total_revenue,
othrsalwages + compnsatnandothr + compnsatncurrofcr as compensation,
profndraising as fundraising_fees,
totfuncexpns as total_expenses,
totassetsend as total_assets,
totliabend as total_liabilities,
totnetassetend as net_assets,
row_to_json(py14_990) as data
from py14_990 ) as stuff
limit 10 
jessicamcinchak commented 8 years ago

2014 990ez

insert into matt
(source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue,
total_expenses,
total_assets, total_liabilities,
net_assets, data)
select
source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue,
 total_expenses,
total_assets, total_liabilities,
net_assets, data
from (select
'2014 990ez' as source, 
subseccd,
"EIN" as ein,
'990ez' as filing_type,
taxprd as tax_period,
totcntrbs as contributions_and_grants,
prgmservrev as program_service_revenue,
othrinvstinc as investment_income,
-- othrsalwages + compnsatncurrofcr as compensation,
-- profndraising as fundraising_fees,
totrevnue as total_revenue,
totexpns as total_expenses,
totassetsend as total_assets,
totliabend as total_liabilities,
totnetassetsend as net_assets,
row_to_json("py14_EZ") as data
from "py14_EZ" ) as stuff
limit 10
jessicamcinchak commented 8 years ago

2014 990pf

insert into matt
(
source, 
ein,
subseccd,
irs_year,
filing_type,
tax_period,
contributions_and_grants,
compensation,
investment_income,
total_revenue,
total_expenses,
total_assets,
total_liabilities,
grants_paid,
data
)
select
source, 
ein,
subseccd,
irs_year,
filing_type,
tax_period,
contributions_and_grants,
compensation,
investment_income,
total_revenue,
total_expenses,
total_assets,
total_liabilities,
grants_paid,
data
from (select
'2014 990PF' as source, 
"SUBCD" as subseccd,
"EIN" as ein,
'990pf' as filing_type,
"TAX_PRD" as tax_period,
"TAX_YR" as irs_year,
"GRSCONTRGIFTS" as contributions_and_grants,
-- prgmservrev as program_service_revenue,
"NETINVSTINC" as investment_income,
"COMPOFFICERS" as compensation,
-- profndraising as fundraising_fees,
"TOTRCPTPERBKS" as total_revenue,
"TOTEXPNSPBKS" as total_expenses,
"TOTASSETSEND" as total_assets,
"TOTLIABEND" as total_liabilities,
"CONTRPDPBKS" as grants_paid,
-- totnetassetsend as net_assets,
row_to_json(py14_990pf) as data
from py14_990pf) as stuff
limit 10 
jessicamcinchak commented 8 years ago

2015 990

insert into matt
(
source, 
ein, 
subseccd, 
filing_type, 
tax_period,
contributions_and_grants, 
program_service_revenue,
investment_income, 
total_revenue, 
compensation,
fundraising_fees, 
total_expenses, 
total_assets, 
total_liabilities,
net_assets, 
data
)
select
source, 
ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue, compensation,
fundraising_fees, total_expenses, total_assets, total_liabilities,
net_assets, data
from (select
'2015 990' as source, 
subseccd,
"EIN" as ein,
'990' as filing_type,
tax_prd as tax_period,
totcntrbgfts as contributions_and_grants,
totprgmrevnue as program_service_revenue,
invstmntinc as investment_income,
totrevenue as total_revenue,
othrsalwages + compnsatnandothr + compnsatncurrofcr as compensation,
profndraising as fundraising_fees,
totfuncexpns as total_expenses,
totassetsend as total_assets,
totliabend as total_liabilities,
totnetassetend as net_assets,
row_to_json(py15_990) as data
from py15_990 ) as stuff
limit 10 
jessicamcinchak commented 8 years ago

2015 990ez

insert into matt
(source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue,
total_expenses,
total_assets, total_liabilities,
net_assets, data)
select
source, ein, subseccd, filing_type, tax_period,
contributions_and_grants, program_service_revenue,
investment_income, total_revenue,
 total_expenses,
total_assets, total_liabilities,
net_assets, data
from (select
'2015 990 EZ' as source, 
subseccd,
"EIN" as ein,
'990ez' as filing_type,
a_tax_prd as tax_period,
totcntrbs as contributions_and_grants,
prgmservrev as program_service_revenue,
othrinvstinc as investment_income,
-- othrsalwages + compnsatncurrofcr as compensation,
-- profndraising as fundraising_fees,
totrevnue as total_revenue,
totexpns as total_expenses,
totassetsend as total_assets,
totliabend as total_liabilities,
totnetassetsend as net_assets,
row_to_json("py15_990ez") as data
from "py15_990ez" ) as stuff
limit 10
jessicamcinchak commented 8 years ago

2015 990pf

insert into matt
(
source, 
ein,
subseccd,
irs_year,
filing_type,
tax_period,
contributions_and_grants,
compensation,
investment_income,
total_revenue,
total_expenses,
total_assets,
total_liabilities,
grants_paid,
data
)
select
source, 
ein,
subseccd,
irs_year,
filing_type,
tax_period,
contributions_and_grants,
compensation,
investment_income,
total_revenue,
total_expenses,
total_assets,
total_liabilities,
grants_paid,
data
from (select
'2015 990 pf' as source, 
"SUBCD" as subseccd,
"EIN" as ein,
'990pf' as filing_type,
"TAX_PRD" as tax_period,
"TAX_YR" as irs_year,
"GRSCONTRGIFTS" as contributions_and_grants,
-- prgmservrev as program_service_revenue,
"NETINVSTINC" as investment_income,
"COMPOFFICERS" as compensation,
-- profndraising as fundraising_fees,
"TOTRCPTPERBKS" as total_revenue,
"TOTEXPNSPBKS" as total_expenses,
"TOTASSETSEND" as total_assets,
"TOTLIABEND" as total_liabilities,
"CONTRPDPBKS" as grants_paid,
-- totnetassetsend as net_assets,
row_to_json(py15_990pf) as data
from py15_990pf) as stuff
limit 10
hampelm commented 8 years ago

Note that the real inserts should go into the combined table or another official table. matt is just a scratchpad for my testing, so don't rely on it to have good data. Use the schema above to create your own table for testing imports.

JerryPoore commented 7 years ago

Hi Hampelm. I know nothing about how this works. I read about it in Friedman's book Thank you for Being Late. There is something here I know but how to mine it, I don't. I am an Enrolled Agent for the IRS and I help people resolve tax issues. I am looking for a data base that tells me who the IRS has sent Tax audits, tax notices, and tax liens and etc. The faster I know this the faster I can help people avoid overspending on high priced professionals. It's digitized somewhere, I need an expert to dig it out on a program like GitHub. Any comments? Thank you Jerry