cfpb / regtech-user-fi-management

REST API for user and institution data in CFPB's RegTech systems
Creative Commons Zero v1.0 Universal
0 stars 1 forks source link

Create alembic script for institutions table update #48

Closed lchen-2101 closed 10 months ago

lchen-2101 commented 11 months ago

the fields that need to be added and modified:

lei tax_id rssd respondent_name (modified from name) parent_id_rssd parent_name top_holder_rssd top_holder_name prim_fed_reg street_line1 street_line2 city state_cd zip_cd parent_lei top_holder_lei fi_type

hkeeler commented 11 months ago

A few thoughts and questions. Just trying to make the column names more explicit, and keep things flexible for the future. Some of these are probably questions for @nongarak and @Kibrael.

  1. I agree name is probably not sufficient, but respondent_name feels like HMDA-specific terminology. How about legal_name, which is what the GLEIF data uses? I feel like we should lean into their terminology since we're using LEI as our primary identifier.

  2. I'd like to prefix the address fields in case we need to support multiple address in the future. GLEIF data has LegalAddress and HeadquartersAddress (plus 5 additional "other" address). I think we've said we want headquarters address, so something like:

    • hq_address_street_1
    • hq_address_street_2
    • hq_address_city
    • hq_address_state
    • hq_address_zip

    ...but if it's actually LegalAddress we're basing this off of, then legal_*.

  3. What does the _cd signify on state_cd and zip_cd?

  4. When we say tax_id, do we mean tin, which can be one of a few different ID types, or are we specifically talking about Employer Identification Number (EIN)? If the latter, do we want to name it as such? Or name it federal_tax_id, which sounds like it is synonymous with EIN?

  5. Looking at NIC's search website, it seems like it's generally referred to as RSSD ID. So...

    • rssd -> rssd_id
    • parent_id_rssd -> parent_rssd_id
    • top_holder_rssd_id
  6. I assume there's going to be a corresponding federal_regulator table. I think it's a nice convention to have to column names match corresponding lookup table, so something like primary_federal_regulator.

  7. Similarly, I assume we'll have an institution_type table, so I think the fi_type should match up there too.

  8. How do we feel about US-centric address fields? Is every entry in this database going to have a US address? I'm guessing yes if this is only for institutions who are filing, but if we wanted to include the parent/top-holder records, and link them by foreign key, the answer may turn into no. In past projects I've used something like the following, which also lines up with GLEIF's.

    • hq_address_state -> hq_address_region
    • hq_address_zip -> hq_address_postal_code
    • ...addition of hq_address_country

    ...also, not everything is a state, even if this is always going to be US addresses. For one, there's...

    • "District of Columbia (United States of America)","US-DC","COUNTRY_AND_SUBDIVISION"

    ...and GLEIF considers all the rest of the US territories to be countries, not regions within the US.

    • "Guam","GU","COUNTRY_ONLY"
    • "Puerto Rico","PR","COUNTRY_ONLY"
    • "Virgin Islands (U.S.)","VI","COUNTRY_ONLY"
    • etc.

    Are we going to follow that, adding a _country column? If yes, _state/_region will need to be nullable, which it wouldn't otherwise.

    I have mixed feelings about this one. I like that it lines up with GLEIF and is more open, but I suspect this will be 99% US addresses, and adding the extra flexibility/abstraction may lead to initial what's a region? type question when someone first comes to understand this data.

    See GLEIF Accepted Legal Jurisdictions Code List for how represent this data.

Kibrael commented 11 months ago
  1. Agree with Hans
  2. Agree with Hans
  3. _cd means code
  4. @nongarak Can you compare between HMDA and SBL to find out the answer for each of these? I think HMDA is EIN, but I am not sure.
  5. No opinion
  6. I don't understand the need for a federal regulator table. I think this is just the primary federal regulator attribute from NIC currently.
  7. institution_type is the HMDA field, there is a differently defined column for SBL FI type. They do not have the same values.
  8. This might take some investigation. We need to accommodate for all US territories, which is bigger than the HMDA scope (US states + PR). For HMDA, I smashed the PR country code into the state column for the Panel data.

    There is some patterning in how they do the state coding in GLEIF, IE "US-DC", I think the US prefix is standard and can be used in patterns.

@nongarak can you post the answer Larry gave us about foreign owned/controlled entities that are covered for US reporting?

hkeeler commented 11 months ago
  1. I don't understand the need for a federal regulator table. I think this is just the primary federal regulator attribute from NIC currently.

By having a separate lookup table we get a few things.

  1. Guaranteed data integrity. You can only ever put a value in that field if it matches one of the values in that table.
  2. That's where you put the various metadata about each of the regulators. We may want the full name of the regulator in the SBL app. If we don't keep that in a table, we'll have to keep that logic in the app...and we don't want that.
  1. institution_type is the HMDA field, there is a differently defined column for SBL FI type. They do not have the same values.

Yeah. We should figure out how to differentiate those. institution_type and fi_type in the same table is too confusing in my book. We could...

  1. This might take some investigation. We need to accommodate for all US territories, which is bigger than the HMDA scope (US states + PR). For HMDA, I smashed the PR country code into the state column for the Panel data. There is some patterning in how they do the state coding in GLEIF, IE "US-DC", I think the US prefix is standard and can

đź‘Ť Yep. I think the the ISO-3166 standard. To add to the fun there, it sounds like the standard allows for those US subdivisions to be either their own country (VI) or under the US (US-VI). It seems like GLEIF is going with the first option. Be curious if that is consistent throughout. Maybe that's why they have their own spreadsheet of jurisdictions so there's no ambiguity.

hkeeler commented 11 months ago

@Kibrael and I chatted about some of this...

  1. I think we're good on why we'd want a lookup table for federal_regulator and similar tables.
  2. We're going to go with hmda_institution_type and sbl_institution_type.
  3. We're going to keep it simple to start, and just put US territories into a state_or_territory column, and not have a country column.
hkeeler commented 11 months ago

Here's my take on the current state of things here. There's still a fair number of questions, but nothing that is a show-stopper from starting work on all this. I'll work with @Kibrael and @nongarak to try to get the rest answered over the coming days.

Below are the tables we'll need for this work. I've included the data for the lookup tables when available.


institution table

column datatype constraints notes / questions
lei CHAR(20) required, unique Table's primary key
legal_name VARCHAR required
tax_id CHAR(9) âť“ âť“required, unique Q1: Do we know if this is only EIN (12-3456789) format, or will SSN-formatted IDs (123-45-6789) come through as well? Q2: Do we strip dashes and just have this be 9 int digits?
rssd_id INTEGER âť“required, unique
primary_federal_regulator_id VARCHAR(4) âť“ Foreign key to federal_regulator table.
hmda_institution_type_id CHAR(âť“) âť“ Foreign key to hmda_institution_type table.
sbl_institution_type_id VARCHAR âť“ âť“ Foreign key to sbl_institution_type table.
hq_address_street_1 VARCHAR required Q: Do we need more than 2 street_x columns? @nongarak's demo showed how GLEIF's address data can has up to 5 (I think), and the first is frequently filled with C/O type data, not actual address data. Or will downstream NIC/GLEIF data munging get us down to just 2 useful fields?
hq_address_street_2 VARCHAR
hq_address_city VARCHAR
hq_address_state CHAR(2) Foreign key to address_state table
hq_address_zip CHAR(5) âť“ required Q: Do we need to support 12345-6789 extended zips? Q: Do all territories have "zip codes"? Should we make this a more universal hq_address_postal_code?
parent_lei CHAR(20)
parent_legal_name VARCHAR âť“ Q: Required if parent_lei or parent_rssd_id is set?
parent_rssd_id INTEGER
top_holder_lei CHAR(20)
top_holder_legal_name VARCHAR âť“ Q: Required if parent_lei or parent_rssd_id is set?
top_holder_rssd_id INTEGER

federal_regulator table

column datatype constraints notes / questions
id VARCHAR(4) required, unique
name VARCHAR required, unique

As for the data we insert into this lookup table, I think we can just follow what's in NIC. Per page 26 of NIC BULK DATA DOWNLOAD DATA DICTIONARY AND REFERENCE GUIDE

The Primary Federal Regulator column contains the name of the agency that is the primary regulator of a depository institution, non-deposit trust company, bank holding company, thrift holding company, farm credit bank, or federal housing enterprise. The following agencies are represented:

FCA = Farm Credit Administration FDIC = Federal Deposit Insurance Corporation FHFA = Federal Housing Finance Agency FRS = Federal Reserve System NCUA = National Credit Union Administration OCC = Office of the Comptroller of the Currency OTS = Office of Thrift Supervision (only valid until July 21, 2011)

Q: Should CFPB be in that list too?

hmda_institution_type table

column datatype constraints notes / questions
id âť“ required, unique Table's primary key
name VARCHAR required, unique

@Kibrael, what values will go in this table? Is this the same as NIC's ENTITY_TYPE? Per page 14 of Per page 26 of NIC BULK DATA DOWNLOAD DATA DICTIONARY AND REFERENCE GUIDE:

The Entity Type field is derived from other fields. The entity types listed below are in alphabetic sequence by entity code.

AGB - Agreement Corporation - Banking AGI - Agreement Corporation - Investment BHC - Bank Holding Company ...

sbl_institution_type table

column datatype constraints notes / questions
id âť“ required, unique Table's primary key
name VARCHAR required, unique

@Kibrael, what values go in this table?

address_state table

column datatype constraints notes / questions
code CHAR(2) required, unique Table's primary key.
name VARCHAR required, unique

Values here should be pretty self-explanatory. All 50 states and D.C. and territories...

code name
AS American Samoa
DC District of Columbia
GU Guam
MP Northern Mariana Islands
PR Puerto Rico
UM United States Minor Outlying Islands
VI Virgin Islands, U.S.

...assuming we're going with ISO-3166.

nongarak commented 10 months ago

Whoops, sorry for the delay. Turns out I did not have notifications turned on for GH Public. I'm glad you've answered the hard questions without me though! Some thoughts:

re: TIN vs. SSN. I'll ask Regs, but you're right - sole props / single member LLCs can use SSN instead of TIN. I doubt we'll have that many of them, but it's a case worth considering. We probably couldn't publish those.

re: street address columns. I think it's probably best if we include all of them, as we will be pulling it straight from GLEIF. Long term maybe we could do some data munging, but I don't think it's a priority and also there is zero consistency in how folks submit this data.

re: ZIP code. There is nothing in the reg specifying what HQ address means, so I think we can safely suck in whatever GLEIF has. If we can suck in the full zip5+4 code for their confirmation purposes, we can later strip it to zip5 for any other analysis/reporting purposes if we need to. All US territories also use ZIP code. I don't think there's a need for calling it "postal code"

re: parent/top_holder_legal_name - definitely required if the parent RSSD or LEI are set, but also can be set without those things (not all parents will have RSSD or LEI).

re: should CFPB be on the prim fed reg list - that's definitely a @Kibrael question, but if there's no decisive answer we can ask Regs. My gut says no, because I feel like PFR has some definition somewhere that we aren't aware of, but also CFPB has "primary authority to enforce federal consumer financial laws for banks and other depository institutions with total assets of more than $10 billion, and their affiliates" (though I'm not sure what the difference between primary supervision/enforcement authority and PFR might be).

re: HMDA institution type code: I haven't parsed through it all, but this ticket outlines the logic used to describe it (on GHE)

re: SBL Type of Financial Institution - here is what the reg says the options are.

Paragraph 109(b)(9).

  1. Type of financial institution. A financial institution complies with § 1002.109(b)(9) by selecting the applicable type or types of financial institution from the list below. A financial institution shall select all applicable types. i. Bank or savings association. ii. Minority depository institution. iii. Credit union. iv. Nondepository institution. v. Community development financial institution (CDFI). vi. Other nonprofit financial institution. vii. Farm Credit System institution. viii. Government lender. ix. Commercial finance company. x. Equipment finance company. xi. Industrial loan company. xii. Online lender. xiii. Other

  2. Use of “other” for type of financial institution. A financial institution reports type of financial institution as “other” where none of the enumerated types of financial institution appropriately describe the applicable type of financial institution, and the institution reports the type of financial institution via free-form text field. A financial institution that selects at least one type from the list is permitted, but not required, to also report “other” (with appropriate free-form text) if there is an additional aspect of its business that is not one of the enumerated types set out in comment 109(b)(9)-1. 3. Additional types of financial institution. The Bureau may add additional types of financial institutions via the Filing Instructions Guide and related materials. Refer to the Filing Instructions Guide for any updates for each reporting year.

I guess that means we need to add another column for free form text field to your table, @hkeeler

re: @Kibrael's earlier question about the readout from Larry on foreign controlled entities, he said there should be a US HQ address that they would report. So a big international bank (like HSBC or whatever) would report their US entity information.

nongarak commented 10 months ago

Other notes: GLEIF names the address data as headquartersaddress_firstaddressline, _additionaladdressline1, etc. which is where I originally got hq_address and hq_addressline1, etc. from. I think we should stick to calling it 'addresslineX' instead of hq_address_street_X because the additional address lines coming in from GLEIF are not actually all street addresses - there's the C/Os, the suite numbers, the floors, etc. and unless and until we're willing to devise some way of cleaning that info, I think we should just call it what GLEIF calls it since we're just sucking it in and displaying it on the page. Might as well have the internal variable name match the external variable name. For reference, here's the snipped of the SQL call I made:

                       , headquartersaddress_firstaddressline       AS hq_address
                       , headquartersaddress_additionaladdressline1 AS hq_addresline1
                       , headquartersaddress_additionaladdressline2 AS hq_addressline2
                       , headquartersaddress_additionaladdressline3 AS hq_addressline3

If you want _ between every word like hq_address_line_1 that's fine with me.

nongarak commented 10 months ago

I went through and changed all of the column names in my code to match the names hans gave above (except the address naming convention). The final list is this, in order of data point in the rule:

  1. lei_lei
  2. legal_name_lei
  3. hq_address_lei
  4. hq_address_line_1_lei
  5. hq_address_line_2_lei
  6. hq_address_line_3_lei
  7. hq_address_city_lei
  8. hq_address_state_lei
  9. hq_address_country_lei
  10. hq_address_zip_lei
  11. email_domain_hmda
  12. primary_federal_regulator_id_nic
  13. tax_id_hmda
  14. type_of_fi
  15. type_of_fi_other
  16. rssd_id_nic
  17. parent_legal_name_lei
  18. parent_lei_lei
  19. parent_rssd_id_nic
  20. top_holder_legal_name_lei
  21. top_holder_lei_lei
  22. top_holder_rssd_id_nic
nongarak commented 10 months ago

@nargis-sultani you're right, technically we don't need the domains field in this dataset. I have that all cleaned up and parsed out in a separate table and that's probably more appropriate, I can remove it from the current iteration of the dataset here.

Kibrael commented 10 months ago