ajdamico / lodown

locally download and prepare publicly-available microdata
GNU General Public License v3.0
97 stars 47 forks source link

hmda further back? #131

Closed floswald closed 11 months ago

floswald commented 6 years ago

hi @ajdamico ,

i just pulled hmda with lodown, works like a charm. i would need the data a bit further back, starting 2000 or even 1990. I tried to get something going myself but i'm not totally clear what i need to do. i know the urls of the LAR datasets. for 1981-1990 they look like

# https://catalog.archives.gov/catalogmedia/lz/electronic-records/rg-082/hmda/HMD_FACDSB82.txt
GREAT COUNTRY BK            010900201160060100090016020000000000000010001000000080000000000000000000000000000000000000000000  
GREAT COUNTRY BK            010900201160060900090016020000000000000000000000000000100010000000040000000000000000000000000000  

while 1991 seems to look like

# https://catalog.archives.gov/catalogmedia/lz/electronic-records/rg-082/hmda/HMS.F1990.LARS.zip
19910000035301B121000051NA  NANA NA     331200260          1
19910000035301B121000101NA  NANA NA     581400320          2
19910000035301B121000051NA  NANA NA     551200380          3

From looking at your code I see that you create a sas like Reporter_Panel_2010.sas that tells you fieldwidths and you use that to read directly into the db. i was wondering how you created that? i look at the website of the 1991 file, for example, and i see that the layout is similar to what you get in system.file("extdata", "hmda/lar_str.csv", package = "lodown"). but i can't find anything regardign fieldwidths, because the "hmda/lar_str.csv" doesn't have that info, right? anyway, if you have time to look at that, great, otherwise can you give me a bit more guidance? thanks!

ajdamico commented 6 years ago

hi, i think it needs to be done by hand? the positions are in the layouts pdfs, for example https://catalog.archives.gov/id/2580727

i eventually intend to automate archive.gov querying using https://stackoverflow.com/questions/48463895/how-to-scrape-all-files-in-a-catalog-series-from-the-national-archives-archives

i also need to remove monetdblite from the function and instead do something like nppes http://asdfree.com/national-plan-and-provider-enumeration-system-nppes.html

all of this is a low priority for me unless you have funding. whatever you can contribute would be appreciated (sharing your layouts file would be a huge help since it's just a data entry task). thanks

floswald commented 6 years ago

hi again, I may have some manpower to put onto this job. I would need to get a bit more info on the data structure though, so I would be able to explain the task. I'd propose to just post a couple of snippets here for my own benefit and you can comment on them?

Current State

get_catalog_hmda

gets the references for the most recent available years from hmda:

> h[1:15,]
   year          type                                                                      full_url merge_table  dbfolder       db_tablename
1  2006      hmda_lar https://www.ffiec.gov/hmdarawdata/LAR/National/2006HMDAlar%20-%20National.zip   hmda_2006 ./MonetDB      hmda_lar_2006
12 2006      pmic_lar https://www.ffiec.gov/pmicrawdata/LAR/National/2006PMIClar%20-%20National.zip   pmic_2006 ./MonetDB      pmic_lar_2006
20 2006     hmda_inst        https://www.ffiec.gov/hmdarawdata/OTHER/2006HMDAinstitutionrecords.zip   hmda_2006 ./MonetDB     hmda_inst_2006
31 2006     pmic_inst        https://www.ffiec.gov/pmicrawdata/OTHER/2006PMICinstitutionrecords.zip   pmic_2006 ./MonetDB     pmic_inst_2006
2  2007      hmda_lar https://www.ffiec.gov/hmdarawdata/LAR/National/2007HMDAlar%20-%20National.zip   hmda_2007 ./MonetDB      hmda_lar_2007
13 2007      pmic_lar https://www.ffiec.gov/pmicrawdata/LAR/National/2007PMIClar%20-%20National.zip   pmic_2007 ./MonetDB      pmic_lar_2007
21 2007     hmda_inst        https://www.ffiec.gov/hmdarawdata/OTHER/2007HMDAinstitutionrecords.zip   hmda_2007 ./MonetDB     hmda_inst_2007
32 2007     pmic_inst        https://www.ffiec.gov/pmicrawdata/OTHER/2007PMICinstitutionrecords.zip   pmic_2007 ./MonetDB     pmic_inst_2007
39 2007 hmda_reporter             https://www.ffiec.gov/hmdarawdata/OTHER/2007HMDAReporterPanel.zip        <NA> ./MonetDB hmda_reporter_2007
49 2007 pmic_reporter             https://www.ffiec.gov/pmicrawdata/OTHER/2007PMICReporterPanel.zip        <NA> ./MonetDB pmic_reporter_2007
56 2007      hmda_msa                 https://www.ffiec.gov/hmdarawdata/OTHER/2007HMDAMSAOffice.zip        <NA> ./MonetDB      hmda_msa_2007
66 2007      pmic_msa                 https://www.ffiec.gov/hmdarawdata/OTHER/2007HMDAMSAOffice.zip        <NA> ./MonetDB      pmic_msa_2007
3  2008      hmda_lar https://www.ffiec.gov/hmdarawdata/LAR/National/2008HMDAlar%20-%20National.zip   hmda_2008 ./MonetDB      hmda_lar_2008
14 2008      pmic_lar https://www.ffiec.gov/pmicrawdata/LAR/National/2008PMIClar%20-%20National.zip   pmic_2008 ./MonetDB      pmic_lar_2008
22 2008     hmda_inst        https://www.ffiec.gov/hmdarawdata/OTHER/2008HMDAinstitutionrecords.zip   hmda_2008 ./MonetDB     hmda_inst_2008

lodown_hmda

> lar_str
                        field             type
1                    AsOfYear DOUBLE PRECISION
2                RespondentID     VARCHAR(255)
3                  AgencyCode     VARCHAR(255)
4                    LoanType DOUBLE PRECISION
5                PropertyType     VARCHAR(255)
6                 LoanPurpose DOUBLE PRECISION
7                   Occupancy DOUBLE PRECISION
8                  LoanAmount DOUBLE PRECISION
9                 Preapproval     VARCHAR(255)
10                 ActionType DOUBLE PRECISION
11                     MSA_MD     VARCHAR(255)
12                  StateCode     VARCHAR(255)
13                 CountyCode     VARCHAR(255)
14          CensusTractNumber     VARCHAR(255)
15         ApplicantEthnicity     VARCHAR(255)
16       CoApplicantEthnicity     VARCHAR(255)
17             ApplicantRace1     VARCHAR(255)
18             ApplicantRace2     VARCHAR(255)
...

data download and reading of data

if( catalog[ i , 'year' ] < 2010 ) sas_ri <- pr_str else sas_ri <- rp_sas

# read that temporary file directly into MonetDB,
# using only the sas importation script
    read_SAScii_monetdb (
        unzipped_files ,            # the url of the file to download
        sas_ri ,            # the 
        zipped = FALSE ,    # the ascii file is stored in a zipped file
        tl = TRUE ,     # convert all column names to lowercase
        tablename = catalog[ i , 'db_tablename' ] ,
        connection = db
        )

To Do: extend to years before 2006

Example: data for 2004

INPUT
Activity_Year 4
Respondent_ID $ 10
Agency_Code $ 1
Loan_Type $ 1
Loan_Purpose $ 1
...

where $ denotes string, and should be put if the Type in the pdf says AN (alphanumeric).

ajdamico commented 6 years ago

hi, like i said, no monetdblite. maybe review the nppes post? thanks

floswald commented 6 years ago

oh, so you want to store everything as csv now?

fine by me. but for hmda you still are going to need to bits about the SAS scripts above, etc right?

I didnt' understand initially, so it seems you want to rewrite this entire script, not only to add on the earlier stuff?

ajdamico commented 6 years ago

yep, needs to be re-written. yep, sas structures still needed for import. i think lodown functions should only get the csv files and the structure files to the user's computer, and from there the asdfree.com page gives an example of how to import and work with a subset of columns like it does with the nppes example? thanks

ajdamico commented 11 months ago

hi! apologies for the long delay. i've made a couple of big updates to asdfree.com that hopefully make the website a bit better, but i've decided to stop maintaining the lodown package so probably won't fix the bug you've reported. the new asdfree doesn't have hmda data yet, but it's on my to-do list. thanks