Closed floswald closed 11 months ago
hi, i think it needs to be done by hand? the positions are in the layouts pdfs, for example https://catalog.archives.gov/id/2580727
i eventually intend to automate archive.gov querying using https://stackoverflow.com/questions/48463895/how-to-scrape-all-files-in-a-catalog-series-from-the-national-archives-archives
i also need to remove monetdblite from the function and instead do something like nppes http://asdfree.com/national-plan-and-provider-enumeration-system-nppes.html
all of this is a low priority for me unless you have funding. whatever you can contribute would be appreciated (sharing your layouts file would be a huge help since it's just a data entry task). thanks
hi again, I may have some manpower to put onto this job. I would need to get a bit more info on the data structure though, so I would be able to explain the task. I'd propose to just post a couple of snippets here for my own benefit and you can comment on them?
get_catalog_hmda
gets the references for the most recent available years from hmda:
> h[1:15,]
year type full_url merge_table dbfolder db_tablename
1 2006 hmda_lar https://www.ffiec.gov/hmdarawdata/LAR/National/2006HMDAlar%20-%20National.zip hmda_2006 ./MonetDB hmda_lar_2006
12 2006 pmic_lar https://www.ffiec.gov/pmicrawdata/LAR/National/2006PMIClar%20-%20National.zip pmic_2006 ./MonetDB pmic_lar_2006
20 2006 hmda_inst https://www.ffiec.gov/hmdarawdata/OTHER/2006HMDAinstitutionrecords.zip hmda_2006 ./MonetDB hmda_inst_2006
31 2006 pmic_inst https://www.ffiec.gov/pmicrawdata/OTHER/2006PMICinstitutionrecords.zip pmic_2006 ./MonetDB pmic_inst_2006
2 2007 hmda_lar https://www.ffiec.gov/hmdarawdata/LAR/National/2007HMDAlar%20-%20National.zip hmda_2007 ./MonetDB hmda_lar_2007
13 2007 pmic_lar https://www.ffiec.gov/pmicrawdata/LAR/National/2007PMIClar%20-%20National.zip pmic_2007 ./MonetDB pmic_lar_2007
21 2007 hmda_inst https://www.ffiec.gov/hmdarawdata/OTHER/2007HMDAinstitutionrecords.zip hmda_2007 ./MonetDB hmda_inst_2007
32 2007 pmic_inst https://www.ffiec.gov/pmicrawdata/OTHER/2007PMICinstitutionrecords.zip pmic_2007 ./MonetDB pmic_inst_2007
39 2007 hmda_reporter https://www.ffiec.gov/hmdarawdata/OTHER/2007HMDAReporterPanel.zip <NA> ./MonetDB hmda_reporter_2007
49 2007 pmic_reporter https://www.ffiec.gov/pmicrawdata/OTHER/2007PMICReporterPanel.zip <NA> ./MonetDB pmic_reporter_2007
56 2007 hmda_msa https://www.ffiec.gov/hmdarawdata/OTHER/2007HMDAMSAOffice.zip <NA> ./MonetDB hmda_msa_2007
66 2007 pmic_msa https://www.ffiec.gov/hmdarawdata/OTHER/2007HMDAMSAOffice.zip <NA> ./MonetDB pmic_msa_2007
3 2008 hmda_lar https://www.ffiec.gov/hmdarawdata/LAR/National/2008HMDAlar%20-%20National.zip hmda_2008 ./MonetDB hmda_lar_2008
14 2008 pmic_lar https://www.ffiec.gov/pmicrawdata/LAR/National/2008PMIClar%20-%20National.zip pmic_2008 ./MonetDB pmic_lar_2008
22 2008 hmda_inst https://www.ffiec.gov/hmdarawdata/OTHER/2008HMDAinstitutionrecords.zip hmda_2008 ./MonetDB hmda_inst_2008
lodown_hmda
system.file("extdata", "hmda/ins_str.csv", package = "lodown")
and system.file("extdata", "hmda/lar_str.csv", package = "lodown")
> lar_str
field type
1 AsOfYear DOUBLE PRECISION
2 RespondentID VARCHAR(255)
3 AgencyCode VARCHAR(255)
4 LoanType DOUBLE PRECISION
5 PropertyType VARCHAR(255)
6 LoanPurpose DOUBLE PRECISION
7 Occupancy DOUBLE PRECISION
8 LoanAmount DOUBLE PRECISION
9 Preapproval VARCHAR(255)
10 ActionType DOUBLE PRECISION
11 MSA_MD VARCHAR(255)
12 StateCode VARCHAR(255)
13 CountyCode VARCHAR(255)
14 CensusTractNumber VARCHAR(255)
15 ApplicantEthnicity VARCHAR(255)
16 CoApplicantEthnicity VARCHAR(255)
17 ApplicantRace1 VARCHAR(255)
18 ApplicantRace2 VARCHAR(255)
...
office.lines
part is doing.ReporterPanel
differs for different periods. there is a switch at line 175 to choose the right script:if( catalog[ i , 'year' ] < 2010 ) sas_ri <- pr_str else sas_ri <- rp_sas
# read that temporary file directly into MonetDB,
# using only the sas importation script
read_SAScii_monetdb (
unzipped_files , # the url of the file to download
sas_ri , # the
zipped = FALSE , # the ascii file is stored in a zipped file
tl = TRUE , # convert all column names to lowercase
tablename = catalog[ i , 'db_tablename' ] ,
connection = db
)
institutionrecords
as well as lar%20-%20National
(the main LAR data) are structured with column delimiters in year 2006-now. (delimiters differ)hmda lar 2004
INPUT
Activity_Year 4
Respondent_ID $ 10
Agency_Code $ 1
Loan_Type $ 1
Loan_Purpose $ 1
...
where $
denotes string, and should be put if the Type
in the pdf says AN
(alphanumeric).
lar_2004.sas
to system.file("extdata", "hmda/lar_str.csv", package = "lodown")
if ( grepl( "inst|lar" , catalog[ i , 'type' ] ) ){
year==2004
use
read_SAScii_monetdb (
unzipped_files , # the url of the file to download
lar_2004.sas , # the sas script for 2004
zipped = FALSE , # the ascii file is stored in a zipped file
tl = TRUE , # convert all column names to lowercase
tablename = catalog[ i , 'db_tablename' ] ,
connection = db
)
hi, like i said, no monetdblite. maybe review the nppes post? thanks
oh, so you want to store everything as csv now?
fine by me. but for hmda you still are going to need to bits about the SAS scripts above, etc right?
I didnt' understand initially, so it seems you want to rewrite this entire script, not only to add on the earlier stuff?
yep, needs to be re-written. yep, sas structures still needed for import. i think lodown functions should only get the csv files and the structure files to the user's computer, and from there the asdfree.com page gives an example of how to import and work with a subset of columns like it does with the nppes example? thanks
hi! apologies for the long delay. i've made a couple of big updates to asdfree.com that hopefully make the website a bit better, but i've decided to stop maintaining the lodown package so probably won't fix the bug you've reported. the new asdfree doesn't have hmda data yet, but it's on my to-do list. thanks
hi @ajdamico ,
i just pulled hmda with lodown, works like a charm. i would need the data a bit further back, starting 2000 or even 1990. I tried to get something going myself but i'm not totally clear what i need to do. i know the urls of the LAR datasets. for 1981-1990 they look like
while 1991 seems to look like
From looking at your code I see that you create a sas like
Reporter_Panel_2010.sas
that tells you fieldwidths and you use that to read directly into the db. i was wondering how you created that? i look at the website of the 1991 file, for example, and i see that the layout is similar to what you get insystem.file("extdata", "hmda/lar_str.csv", package = "lodown")
. but i can't find anything regardign fieldwidths, because the"hmda/lar_str.csv"
doesn't have that info, right? anyway, if you have time to look at that, great, otherwise can you give me a bit more guidance? thanks!