dat-ecosystem-archive / datproject-discussions

a repo for discussions and other non-code organizing stuff [ DEPRECATED - More info on active projects and modules at https://dat-ecosystem.org/ ]
65 stars 6 forks source link

U.S. federal government data collection use case #48

Open joehand opened 8 years ago

joehand commented 8 years ago

From @feomike on August 19, 2014 15:19

The intent of this issue is to offer a potential use case scenario for dat developers to consider for future development.

Federal government data collection (general)

The federal government collects all kinds of data. a typical data collection goes something like this; the government provides a data template (e.g. fixed field length or csv file, xml spec or other similar data specification), describes valid values and business rules for the data spec (e.g. field 1 can only be a number between 1 and 10), and then also builds a portal to manage users submitting data (data producers are given a login/password, must go to a site, enter metadata and upload a file, then wait for a response that data is accepted). one can only imagine the arcane scenarios for how this technology approach has been implemented across the government.

Generally speaking, dat might be useful to help solve problems which cost the federal government and data producers who must submit data. where dat could prove useful in solving these issues is along these line;

the home mortgage disclosure act (hmda), passed in 1975, collects data from financial institutions (banks and non-depository banks that loan money for mortgages). this data is used to help identify unfair lending practices, public access for sunlight and ensuring financial institutions are making enough credit available for communities.

hmda data is loan level data (e.g. all loans meeting certain criteria are required to be collected from the financial institutions from the government). loan level data includes the originator, borrower information (including race and ethnicity), loan amount, location and others. the full current specification of the data to be submitted is found at this link http://www.ffiec.gov/hmda/fileformats.htm.

hmda data is collected (from jan to march for the previous calendar year) and amounts to about 18 - 20 million rows of data annually. it is collected from roughly 7200 different financial institutions. institutions can range from large multi-billion corporations (eg wells fargo) to very small institutions (even limited liability corporations) who might not even hold deposits, but have carved out a business to loan for some specific need (eg mike's llc - not real).

capacity to produce the file format and all of the subsequent business rules for the data (http://www.ffiec.gov/hmda/edits.htm) are varied along the lines of size of the submitting institution (eg wells fargo likely has very very large IT capacity, mikes llc, perhaps is an ms excel user).

the government does a thorough post submission analysis of the data with more robust edits/business rules to evaluate the quality of the submitted data. these evaluation can result in resubmission of data (eg mikes llc 10,000 rows of loans were all made on april 1).

Copied from original issue: maxogden/dat#153

joehand commented 8 years ago

From @dazzaji on March 9, 2015 4:35

I had the opportunity to work with this data very superficially as part of a Code for America project and remember noting the importance, complexity, potential risks and the significant value potentially resulting from broader awareness about and more varied use of the data set. Seems a natural to make available as a streaming service according to an open standard. Has there been any further discussion about or work on this use case? Thanks.