KorelasiDataInsights / southern_dallas_progress

0 stars 1 forks source link

2023 data ingestion #32

Open korelasidata opened 3 months ago

korelasidata commented 3 months ago

Overview for the next few tasks

Task 1: Download all of the 2023 data sources linked below.

NOTEs:

Here are the links to the specific data sources

image image image image

Task 2: Use the previously created code to run the data processing.

NOTE: Be sure to download the latest code from the GitHub repo to be sure you have the most up-to-date code.

cbharp commented 3 months ago

2023 data ingestion notes:

cbharp commented 3 months ago

TO DO

korelasidata commented 3 months ago

@cbharp Great progress so far. Here are the next steps.

analysis_year = '2023'
file_names_dict = {
    '2022': {
        'ffiec': {'data': 'CensusFlatFile2022.csv',
                  'dictionary': 'FFIEC_Census_File_Definitions_26AUG22.xlsx'
                  },
        'hmda': {'lar': '2022_public_lar_csv.csv',
                 'panel': '2022_public_panel_csv.csv',
                 'ts': '2022_public_ts_csv.csv'
                 },
        'cra': {'data': ['cra2022_Discl_D11.dat', ...], # list all CRA filenames
                'dictionary': ''}, 
        'sba': {'data': 'foia-7afy2020-present-asof-230930.csv',
                'dictionary': '7a_504_foia-data-dictionary.xlsx'},
        'fdic': {'locations': {'data': 'locations.csv',
                               'dictionary': 'locations_definitions.csv'},
                 'institutions': {'data': 'institutions.csv',
                                  'dictionary': 'institutions_definitions.csv'
                 }

    },
    '2023': {
        # Similar structure as 2022
    }
}

For example, here's how we would reference the file names throughout the notebook

# FFIEC file name
ffiec_data_file = file_names_dict[analysis_year]['ffiec']['data']
korelasidata commented 3 months ago

@cbharp I had a chance to compare the 2022 and 2023 SBA dictionaries. The mapping of categorical levels seems to be unchanged. I did note the following changes:

You can skip the SBA data dictionary comparison task.

korelasidata commented 3 months ago

@cbharp The new data dictionaries for FDIC institutions and FDIC locations are identical to the old. You can skip the FDIC data dictionary comparison task.

korelasidata commented 3 months ago

@cbharp The URLs for 2022 and 2023 HMDA data specifications are identical (i.e., they both point to the same website). You can skip the HMDA data dictionary comparison task.

korelasidata commented 3 months ago

@cbharp The FFIEC data changed one field. Please fix the following in the data ingestion pipeline. You can skip the FFIEC data dictionary comparison task.

2022 data dictionary

image

2023 data dictionary

image
korelasidata commented 3 months ago

@cbharp I made a minor change to the 'cra' slot in file_names_dict definition in my comment. Please refresh the page to see the update.

korelasidata commented 3 months ago

@cbharp SBA Approval date is hardcoded in as being filtered for 2022. We need this to be dynamically determined according to the Analysis Year.

image
korelasidata commented 3 months ago

@cbharp Filters for FDIC locations and institutions are also hard-coded for 2022. We need this to be dynamically determined according to the Analysis Year.

image image
korelasidata commented 3 months ago

@cbharp

korelasidata commented 3 months ago

@cbharp Upload all files exported in the data processing pipeline to this Dropbox folder