Open korelasidata opened 3 months ago
2023 data ingestion notes:
ffiec
WITH NO CHANGES
-> takes any file string for data and data dictionary
CHANGES NEEDED
-> need to change the year in the flat file string
-> the ffiec definition file does not contain the same month so the entire file name string might need to be changed every time
RESULT AFTER CHANGES
-> when looking quickly, everything looks the same
-> Outputs from value_counts are exactly the same which seems weird
hmda
WITH NO CHANGES
-> takes any data folder string
CHAMGES NEEDED
-> need to change following hard coded file names:
->> 2022_public_lar_csv.csv
->> 2022_public_ts_csv.csv
->> 2022_public_panel_csv.csv
->>> adding in code to take year from analysis year provided so number of inputs does not need to change.
-> msmda file need to be updated for 2022 msmda file will be uses(hard coded)
-> hard coded 'data' to be joined with the name of the msamda file when reading this file in while waiting on response for above question.
fdic WITH NO CHANGES(institution label adder): -> takes any file name string CHANGES NEEDED
remove "c" from function name?(potentially no because it looks like this is something in the actual dataset)
added new file path name in function call
WITH NO CHANGES(institution)
-> takes any filename string
CHANGES NEEDED
change file path name RESULT -> the results look similar upon initial glance
WITH NO CHANGES(locations) -> takes and file string for definitions and locations
CHANGES NEEDED
sba
WITH NO CHANGES
-> takes url
CHANGES NEEDED
RESULT -> output looks the same upon initial look
TO DO
@cbharp Great progress so far. Here are the next steps.
[ ] Confirm that the MSA/MD HMDA are not needed. If so, comment out the MSA/MD HMDA references in lines 956-963
[x] For FFIEC, HMD, and SBA data sources, compare the 2022 and 2023 data dictionaries and note any changes. Add a comment that includes your note about each source (regardless of whether changes are needed)
[ ] Create dictionary of file names by year. See example below.
analysis_year = '2023'
file_names_dict = {
'2022': {
'ffiec': {'data': 'CensusFlatFile2022.csv',
'dictionary': 'FFIEC_Census_File_Definitions_26AUG22.xlsx'
},
'hmda': {'lar': '2022_public_lar_csv.csv',
'panel': '2022_public_panel_csv.csv',
'ts': '2022_public_ts_csv.csv'
},
'cra': {'data': ['cra2022_Discl_D11.dat', ...], # list all CRA filenames
'dictionary': ''},
'sba': {'data': 'foia-7afy2020-present-asof-230930.csv',
'dictionary': '7a_504_foia-data-dictionary.xlsx'},
'fdic': {'locations': {'data': 'locations.csv',
'dictionary': 'locations_definitions.csv'},
'institutions': {'data': 'institutions.csv',
'dictionary': 'institutions_definitions.csv'
}
},
'2023': {
# Similar structure as 2022
}
}
For example, here's how we would reference the file names throughout the notebook
# FFIEC file name
ffiec_data_file = file_names_dict[analysis_year]['ffiec']['data']
@cbharp I had a chance to compare the 2022 and 2023 SBA dictionaries. The mapping of categorical levels seems to be unchanged. I did note the following changes:
subpgmdesc
from the 2022 dictionary was renamed to Subprogram
FixedOrVariableInterestInd
was added in 2023SOLDSECMRTIND
in the old dictionary, but it appears in your code for 2022. The 2023 dictionary has a field calledSoldSecMrktInd
.You can skip the SBA data dictionary comparison task.
@cbharp The new data dictionaries for FDIC institutions and FDIC locations are identical to the old. You can skip the FDIC data dictionary comparison task.
@cbharp The URLs for 2022 and 2023 HMDA data specifications are identical (i.e., they both point to the same website). You can skip the HMDA data dictionary comparison task.
@cbharp The FFIEC data changed one field. Please fix the following in the data ingestion pipeline. You can skip the FFIEC data dictionary comparison task.
@cbharp I made a minor change to the 'cra' slot in file_names_dict
definition in my comment. Please refresh the page to see the update.
@cbharp SBA Approval date is hardcoded in as being filtered for 2022. We need this to be dynamically determined according to the Analysis Year.
@cbharp Filters for FDIC locations and institutions are also hard-coded for 2022. We need this to be dynamically determined according to the Analysis Year.
@cbharp
soldsecmr
prefix. Rename as SOLDSECMRTIND
.@cbharp Upload all files exported in the data processing pipeline to this Dropbox folder
pickle
objects at the end of run_data_ingestion.ipynb
. Please do not upload the raw unprocessed files.
Overview for the next few tasks
Task 1: Download all of the 2023 data sources linked below.
NOTEs:
data/
folder file structure so that each analysis year has its own subfolder. For example, move all of the current 2022 data to thedata/2022
subfolder, and createdata/2023
for the 2023 analysis.run-data-ingestion.ipynb
andhelper_funcs.py
files.Here are the links to the specific data sources
Task 2: Use the previously created code to run the data processing.
NOTE: Be sure to download the latest code from the GitHub repo to be sure you have the most up-to-date code.