hmda further back? - Githubissues

floswald commented 6 years ago

hi @ajdamico ,

i just pulled hmda with lodown, works like a charm. i would need the data a bit further back, starting 2000 or even 1990. I tried to get something going myself but i'm not totally clear what i need to do. i know the urls of the LAR datasets. for 1981-1990 they look like

# https://catalog.archives.gov/catalogmedia/lz/electronic-records/rg-082/hmda/HMD_FACDSB82.txt
GREAT COUNTRY BK            010900201160060100090016020000000000000010001000000080000000000000000000000000000000000000000000  
GREAT COUNTRY BK            010900201160060900090016020000000000000000000000000000100010000000040000000000000000000000000000

while 1991 seems to look like

# https://catalog.archives.gov/catalogmedia/lz/electronic-records/rg-082/hmda/HMS.F1990.LARS.zip
19910000035301B121000051NA  NANA NA     331200260          1
19910000035301B121000101NA  NANA NA     581400320          2
19910000035301B121000051NA  NANA NA     551200380          3

From looking at your code I see that you create a sas like Reporter_Panel_2010.sas that tells you fieldwidths and you use that to read directly into the db. i was wondering how you created that? i look at the website of the 1991 file, for example, and i see that the layout is similar to what you get in system.file("extdata", "hmda/lar_str.csv", package = "lodown"). but i can't find anything regardign fieldwidths, because the "hmda/lar_str.csv" doesn't have that info, right? anyway, if you have time to look at that, great, otherwise can you give me a bit more guidance? thanks!

ajdamico commented 6 years ago

hi, i think it needs to be done by hand? the positions are in the layouts pdfs, for example https://catalog.archives.gov/id/2580727

i eventually intend to automate archive.gov querying using https://stackoverflow.com/questions/48463895/how-to-scrape-all-files-in-a-catalog-series-from-the-national-archives-archives

i also need to remove monetdblite from the function and instead do something like nppes http://asdfree.com/national-plan-and-provider-enumeration-system-nppes.html

all of this is a low priority for me unless you have funding. whatever you can contribute would be appreciated (sharing your layouts file would be a huge help since it's just a data entry task). thanks

floswald commented 6 years ago

hi again, I may have some manpower to put onto this job. I would need to get a bit more info on the data structure though, so I would be able to explain the task. I'd propose to just post a couple of snippets here for my own benefit and you can comment on them?

Current State

`get_catalog_hmda`

gets the references for the most recent available years from hmda:

> h[1:15,]
   year          type                                                                      full_url merge_table  dbfolder       db_tablename
1  2006      hmda_lar https://www.ffiec.gov/hmdarawdata/LAR/National/2006HMDAlar%20-%20National.zip   hmda_2006 ./MonetDB      hmda_lar_2006
12 2006      pmic_lar https://www.ffiec.gov/pmicrawdata/LAR/National/2006PMIClar%20-%20National.zip   pmic_2006 ./MonetDB      pmic_lar_2006
20 2006     hmda_inst        https://www.ffiec.gov/hmdarawdata/OTHER/2006HMDAinstitutionrecords.zip   hmda_2006 ./MonetDB     hmda_inst_2006
31 2006     pmic_inst        https://www.ffiec.gov/pmicrawdata/OTHER/2006PMICinstitutionrecords.zip   pmic_2006 ./MonetDB     pmic_inst_2006
2  2007      hmda_lar https://www.ffiec.gov/hmdarawdata/LAR/National/2007HMDAlar%20-%20National.zip   hmda_2007 ./MonetDB      hmda_lar_2007
13 2007      pmic_lar https://www.ffiec.gov/pmicrawdata/LAR/National/2007PMIClar%20-%20National.zip   pmic_2007 ./MonetDB      pmic_lar_2007
21 2007     hmda_inst        https://www.ffiec.gov/hmdarawdata/OTHER/2007HMDAinstitutionrecords.zip   hmda_2007 ./MonetDB     hmda_inst_2007
32 2007     pmic_inst        https://www.ffiec.gov/pmicrawdata/OTHER/2007PMICinstitutionrecords.zip   pmic_2007 ./MonetDB     pmic_inst_2007
39 2007 hmda_reporter             https://www.ffiec.gov/hmdarawdata/OTHER/2007HMDAReporterPanel.zip        <NA> ./MonetDB hmda_reporter_2007
49 2007 pmic_reporter             https://www.ffiec.gov/pmicrawdata/OTHER/2007PMICReporterPanel.zip        <NA> ./MonetDB pmic_reporter_2007
56 2007      hmda_msa                 https://www.ffiec.gov/hmdarawdata/OTHER/2007HMDAMSAOffice.zip        <NA> ./MonetDB      hmda_msa_2007
66 2007      pmic_msa                 https://www.ffiec.gov/hmdarawdata/OTHER/2007HMDAMSAOffice.zip        <NA> ./MonetDB      pmic_msa_2007
3  2008      hmda_lar https://www.ffiec.gov/hmdarawdata/LAR/National/2008HMDAlar%20-%20National.zip   hmda_2008 ./MonetDB      hmda_lar_2008
14 2008      pmic_lar https://www.ffiec.gov/pmicrawdata/LAR/National/2008PMIClar%20-%20National.zip   pmic_2008 ./MonetDB      pmic_lar_2008
22 2008     hmda_inst        https://www.ffiec.gov/hmdarawdata/OTHER/2008HMDAinstitutionrecords.zip   hmda_2008 ./MonetDB     hmda_inst_2008

some years have reporter panels, some don't
loan application register (LAR) and institution file (inst) are merged into a hmda_yyyy file for each year.
same for private mortgage insurance companies (PMIC)

`lodown_hmda`

uses catalog defined above to download data directly into Monet database
column names and data types are defined in files system.file("extdata", "hmda/ins_str.csv", package = "lodown") and system.file("extdata", "hmda/lar_str.csv", package = "lodown")
these are the same for all hmda datasets 2006-now:

> lar_str
                        field             type
1                    AsOfYear DOUBLE PRECISION
2                RespondentID     VARCHAR(255)
3                  AgencyCode     VARCHAR(255)
4                    LoanType DOUBLE PRECISION
5                PropertyType     VARCHAR(255)
6                 LoanPurpose DOUBLE PRECISION
7                   Occupancy DOUBLE PRECISION
8                  LoanAmount DOUBLE PRECISION
9                 Preapproval     VARCHAR(255)
10                 ActionType DOUBLE PRECISION
11                     MSA_MD     VARCHAR(255)
12                  StateCode     VARCHAR(255)
13                 CountyCode     VARCHAR(255)
14          CensusTractNumber     VARCHAR(255)
15         ApplicantEthnicity     VARCHAR(255)
16       CoApplicantEthnicity     VARCHAR(255)
17             ApplicantRace1     VARCHAR(255)
18             ApplicantRace2     VARCHAR(255)
...

I don't know what the office.lines part is doing.

data download and reading of data

some of the data are not structured into columns when downloaded. the main data are CSV.
we need to know the field length of each variable in order to read the data correctly
The data comes in SAS format, so we need a SAS script with this info
the script for ReporterPanel differs for different periods. there is a switch at line 175 to choose the right script:

if( catalog[ i , 'year' ] < 2010 ) sas_ri <- pr_str else sas_ri <- rp_sas

# read that temporary file directly into MonetDB,
# using only the sas importation script
    read_SAScii_monetdb (
        unzipped_files ,            # the url of the file to download
        sas_ri ,            # the 
        zipped = FALSE ,    # the ascii file is stored in a zipped file
        tl = TRUE ,     # convert all column names to lowercase
        tablename = catalog[ i , 'db_tablename' ] ,
        connection = db
        )

both institutionrecords as well as lar%20-%20National (the main LAR data) are structured with column delimiters in year 2006-now. (delimiters differ)

To Do: extend to years before 2006

we do not have the nicely column-separated data for LAR from fhfa.gov, i.e. we don't have CSV.
instead we have a pdf with column names and widths for each year.
we need to create importation scripts similar to the reporterpanel above.

Example: data for 2004

go to https://catalog.archives.gov
search for hmda lar 2004
click on Non-restricted Final Loan Application Register (LAR) Data, 2004
end up here: https://catalog.archives.gov/id/5716420
item 3: View/Download 2004 Final LAR Layout. is a pdf file.
translate that into SAS like

INPUT
Activity_Year 4
Respondent_ID $ 10
Agency_Code $ 1
Loan_Type $ 1
Loan_Purpose $ 1
...

where $ denotes string, and should be put if the Type in the pdf says AN (alphanumeric).

add this script as lar_2004.sas to system.file("extdata", "hmda/lar_str.csv", package = "lodown")

change section after line 213:

if ( grepl( "inst|lar" , catalog[ i , 'type' ] ) ){

if year==2004 use

read_SAScii_monetdb (
unzipped_files ,            # the url of the file to download
lar_2004.sas ,          # the sas script for 2004
zipped = FALSE ,    # the ascii file is stored in a zipped file
tl = TRUE ,     # convert all column names to lowercase
tablename = catalog[ i , 'db_tablename' ] ,
connection = db
)

it may well be that several years share the same layout (1999-2004, say), in which case you just use the same script for all of those.
The rest of the script is data cleaning and processing, hopefully there is not much change in variable names. If there is, this also needs to be done by year.

ajdamico commented 6 years ago

hi, like i said, no monetdblite. maybe review the nppes post? thanks

floswald commented 6 years ago

oh, so you want to store everything as csv now?

fine by me. but for hmda you still are going to need to bits about the SAS scripts above, etc right?

I didnt' understand initially, so it seems you want to rewrite this entire script, not only to add on the earlier stuff?

ajdamico commented 6 years ago

yep, needs to be re-written. yep, sas structures still needed for import. i think lodown functions should only get the csv files and the structure files to the user's computer, and from there the asdfree.com page gives an example of how to import and work with a subset of columns like it does with the nppes example? thanks

ajdamico commented 11 months ago

hi! apologies for the long delay. i've made a couple of big updates to asdfree.com that hopefully make the website a bit better, but i've decided to stop maintaining the lodown package so probably won't fix the bug you've reported. the new asdfree doesn't have hmda data yet, but it's on my to-do list. thanks

ajdamico / lodown

hmda further back? #131

Current State

`get_catalog_hmda`

`lodown_hmda`

data download and reading of data

To Do: extend to years before 2006

Example: data for 2004