Punderthings / fossfoundation

Directory of non-profit FOSS Foundations, with detailed metadata.
https://fossfoundation.info/
Apache License 2.0
21 stars 14 forks source link

Getting tax information automatically #9

Closed atkissoncj closed 8 months ago

atkissoncj commented 1 year ago

ProPublica has a tool that tries to scrape info from scanned IRS 990 forms (what nonprofits have to submit each year). That ProPublica Nonprofit Explorer has an API. We could automatically get the tax information from ProPublica if a unique identifier is provided. See https://projects.propublica.org/nonprofits/api

ShaneCurcuru commented 1 year ago

Agreed, however since tax data is year-based, we should probably keep that in a separate table that's easier to access directly (without having to crawl various MD files). We definitely need help building scrapers for data that we can reliably get like propublica's listings.

For now, I've been doing that in another website, but if we have maintainer volunteers I'm happy to change how/where we store that data:

https://github.com/ShaneCurcuru/fossfunding/blob/main/_data/

ShaneCurcuru commented 1 year ago

Yes - code-wise this should be simple, once we define a mapping of what data we want to capture.

Practically, using this API is likely to be 2-3 years behind the times, at a minimum, due to extensions and the many delays in the IRS system. This also only captures data back to about 2012 when the extracted data format was setup.

In any case, yes, a simple: curl https://projects.propublica.org/nonprofits/download-filing?path=2002_08_EO%2F47-0825376_990_200204.pdf

Does get a JSON structure where you can iterate easily through "filings_with_data": [...], and even for filings_without_data they include a pdf_url value for use. Thanks for the reminder!

ShaneCurcuru commented 1 year ago

On second thought: ProPublica has both licensing and is only a limited set of actual data from 990 forms. The IRS XML dump forms have much more data, although would require a little more coding work to extract into our own format (for easy use within research). What fields are most valuable for researchers? Do we (on the FOSS side) care about things like size of governing body, public support percentages, and what's reported as hours worked & compensation for directors/officers?

Many XML extract tools require big data setup, but they're focused on trends across all nonprofits. Since we only have a limited set of orgs we want, but we want much richer data, feels like we should find an IRS XML library, and tweak it to focus on just exporting our list of EINs to a spreadsheet(s) or the like.

ShaneCurcuru commented 8 months ago

Code to automatically copy down json of all available basic Propublica data is already committed; we will be storing basic data for this soon as suggested. Note also we're tracking the Giving Tuesday 990 project, which aims to provide a much simpler dataset of all available IRS 990 forms in the future: https://990data.givingtuesday.org/#datasets