Add org type, designation to Schedule A and B tables

dorothyyeager commented 5 years ago

User story I was trying to search all contributions and disbursements made to corporate SSFs in 2019-2020 and set filter accordingly. The results appeared to yield a list of contributions made to all types of committees. The first results in this screenshot, for example, show contributions to a candidate and to a labor organization SSF.

To do

[x] Add org_tp (Call it org_type) and cmte_dsgn (call it committee_designation) columns to Schedule A and B tables.
- [ ] Need to consider how to handle information needed from Line 6 to determine Membership SSFs and nonconnected (see sql below)
[x] Joseph to update Oracle procedure and packages to include org type and designation in the process to update these fields in A and B ("nightly refresh") used by Jean's Java program
[x] Jean update Java programs and shell scripts to include org_tp and cmte_dsgn columns to schedule A and schedule B tables in daily transfer process.
[x] Update programs (Oracle procedure, packages, java scripts, shell scripts) to track changes of cmte_dsgn and org_tp that happened without schedule A or B data changes.
[X] create scripts to back fill data for these new columns for the existing rows.
[x] Populate the data in Schedule A and B tables
[x] Document the process so more than one person can update/maintain this process
[x] Unblock merge Python changes https://github.com/fecgov/openFEC/issues/3841
[x] Add a follow-up CMS issue to add the filters back (see https://github.com/fecgov/fec-cms/issues/2913 for original issue to remove them)

lbeaufort commented 5 years ago

Thanks for entering this issue, @dorothyyeager! After speaking with @PaulClark2, we need organization type and designation in addition to committee type to filter by SSF's and Leadership PACs. We'd need to add these to Schedule A in order to filter by them, which is a big lift.

For now, think we should remove those options from the receipts/individual contributions datatables, since they don't work.

lbeaufort commented 5 years ago

I'll go ahead and copy this to the CMS to remove the filters as a short-term workaround.

PaulClark2 commented 5 years ago

This includes joint fundraising committees. For the statistical releases we exclude joint fundraising committees.

Separate Segregated Funds (SSFs): Corporate: committee types N, Q or U, committee designations B, J or U, organization type C Labor: committee types N, Q or U, committee designations B, J or U, organization type L Trade: committee types N, Q or U, committee designations B, J or U, organization type T Membership: committee types N, Q or U, committee designations B, J or U, organization type M, connected organization name NOT NULL and not equal to NONE Cooperative: committee types N, Q or U, committee designations B, J or U, organization type V Corporations without stock: committee types N, Q or U, committee designations B, J or U, organization type W

Nonconnected Committees: Independent expenditure-only political committees (Super PACs): committee type O, committee designations B, J and U Committees w with non-contribution accounts (Hybrid PACs): committee types V or W, committee designations B, J or U Leadership PACs: committee types N or Q, committee designations D Other nonconnected PACs:: committee types N, Q or U, committee designations B, J or U, organization type IS NULL OR (organization type M AND connected organization name IS NULL) OR (organization type M AND connected organization name NONE)

This sql is similar to what I use for the statistical releases. This sql includes joint fundraising committees.

SELECT 
(CASE WHEN cm.org_tp = 'C' THEN '1-Corporate'                        
    WHEN cm.org_tp = 'L' THEN '2-Labor'                        
    WHEN cm.org_tp = 'T' THEN '3-Trade'                        
    WHEN  ((cm.org_tp IS NULL) OR (cm.org_tp = 'M' AND cm.connected_org_nm IS NULL) OR (cm.org_tp = 'M' AND cm.connected_org_nm = 'NONE')) THEN '7-NonConnected'                    
    WHEN cm.org_tp = 'M'  AND cm.connected_org_nm IS NOT NULL AND cm.connected_org_nm <> 'NONE'  THEN '4-Membership'                        
    WHEN cm.org_tp = 'V' THEN '5-Cooperative'                        
    WHEN cm.org_tp = 'W' THEN '6-Corporation without stock'  END)  organization_type,
cm.cmte_id,
cm.cmte_nm,
cm.cmte_tp, 
cm.cmte_dsgn, 
cm.org_tp
FROM cmte_valid_fec_yr cm
WHERE cm.fec_election_yr = 2020
AND cm.cmte_tp in ('N', 'Q', 'U')
AND cm.cmte_dsgn not in 'D';

lbeaufort commented 5 years ago

The CMS issue is already here: https://github.com/fecgov/fec-cms/issues/2913

lbeaufort commented 5 years ago

We should consider asking the database team whether they'd rather do this at the same time as the DB updates for https://github.com/fecgov/openFEC/issues/3798

PaulClark2 commented 5 years ago

org_tp and cmte_dsgn are in cmte_valid_fec_yr

fecjjeng commented 5 years ago

org_tp and cmte_dsgn columns had been added to fec_fitem_sched_a and fec_fitem_sched_b tables. It is better to keep this concise name of the columns

these tables are semi-base tables, it is better to keep column name consistent with all the other base tables as much as we can so it is easier to understand and maintain.
the object name in database has length limit. Long object name will be truncated. With lengthy table or MV name, plus lengthy column name, the length of derived object name such as index (especially indexes with more than one column) will be truncated and cause problem.

lbeaufort commented 5 years ago

@fecjjeng sounds good! I think the WIP PR has those column names. https://github.com/fecgov/openFEC/pull/3837/files

fecjjeng commented 5 years ago

The following is a summary of the steps need to be done for this ticket. A document with more detail named ADD_UPDATE_ORG_TP_COLUMNS.txt is in the shared drive. Please note: This is a general list of work need to be done for add columns for the sched_a/sched_b tables. The actual code change will really very depending on the nature of the columns added.
-- -- Prepare migration script. The migration script needs to include --

Add column statement
If the added column need to be in the filter, then add create indexes statement. If this is a B-tree indexes, then include the sorting field (disb_dt/sub_id and disb_amt/sub_id) to make compound indexes

Also update the following function as approprieate (which can be used to create indexes for the correspoinging fec_sched_a_xxxx_yyyy/fec_sched_b_xxxx_yyyy) in the future FUNCTION disclosure.finalize_itemized_schedule_a_tables FUNCTION disclosure.finalize_itemized_schedule_b_tables

NOTE: These columns will be added to all 3 cloud environment ahead of time at the same time since

the java program that load the data (which is the same program, just pass in different db as parameter) will run to all environments
the java program that handle the new data need to be in place and ready to transfer BEFORE backfill happened.
the update statement to backfill this column for existing rows and indexes creation are time and resource consuming so need to be performed at off-peak hour

The migration file will be submitted and merged AFTER the actual tasks (add column, backfill data, create indexes) had been done. Therefore it need to be written in a specific way so it will not error out IF the columns and indexes already exists.

-- -- Update Oracle packages, procedures, triggers, tables, materialized views, and views in intermediate database that calculate and control the Java Transfer program. -- -- -- Update Java programs and shell script -- -- -- Add columns to table, start daily transfer program, make sure all Oracle programs, Java programs, and shell scripts execute successfully -- -- -- Prepare scripts to backfilled the data. -- -- -- -- on Postgresql Database -- In this ticket, we also backfill the data for several tsvector columns that their content changed in issue #3798 (3 columns in sched_a and 2 columns in sched_b) -- A temporary update triggers are added to each sched_a and sched_b partitions tables. Therefore during the update activity of this ticket, the update trigger will fire and update the tsvector columns

-- -- ON AWS console -- temporarily upgrade the class of our postgresql database to more powerful machine -- NOTE: need to increase both master and replica. Otherwise a slow replica will not be able to catch up. This will slow down the whole process and the lag time will to too big to recover. -- Parameters WAL_MAX_SIZE, WAL_MIN_SIZE, WAL_KEEP_SEGMENTS will need to be increased as well -- Also need to anticipate increated storage size as well. Update operation in Postgresql will increase the size a lot. Also indexes will need space too. --

-- *** -- on Postgresql Database

-- get "BEFORE UPDATE" size select pg_size_pretty(pg_database_size('fec')); -- *** -- on Linux server -- Execute the shell script to do the update work.(may not be able to finish all job at one time, need to break into smaller tasks) -- NOTE: need to adjust the execution time to avoid nightly refresh time for all databases -- NOTE: need to adjust the execution time to avoid peak query time for PRD database

-- *** -- on Postgresql Database

-- get "AFTER UPDATE" size select pg_size_pretty(pg_database_size('fec'));

-- **** -- validate the update

-- **** There are few rows with cmte_id that does not have corresponding data in cmte_valid_fec_yr. These are all in earlier cycles. Data after 2007_2008 are completed with cmte_dsgn/org_tp data

For those few rows, the tsvector columns are not updated. Use the following statement to update them.

-- **** -- Run create indexes script

-- **** -- check if all indexes are created select tablename, count(*) from pg_indexes where tablename like 'fec_fitem_sched_a%' and (indexname like 'idx_scheda%_org_tp%' or indexname like 'idx_scheda%_cmte_dsgn%') group by tablename order by tablename;

select tablename, indexname from pg_indexes where tablename like 'fec_fitem_sched_a%' and (indexname like 'idx_scheda%_org_tp%' or indexname like 'idx_scheda%_cmte_dsgn%') order by tablename, indexname -- **** -- clean up the temporary UPDATE TRIGGERS that were used to update the tsvector columns from another issue

-- -- ON AWS console -- downgrade the class to be back to the regular class we use.
-- NOTE: need to increase both master and replica. Otherwise a slow replica will not be able to catch up. This will slow down the whole process and the lag time will to too big to recover. -- Parameters WAL_MAX_SIZE, WAL_MIN_SIZE, WAL_KEEP_SEGMENTS will need to be back to original value -- Review the storage size increase, make it official via Terraform. --

fecjjeng commented 5 years ago

DEV/STG/PRD databases had been backfilled, indexes created. Following is the summary of time consumed/space increased by this task. NOTE: The time indicated is the actual running time, with most powerful class of AWS database we can use, does NOT include lag time in between to avoid peak time and nightly refresh time. So it is a minimum. The local database is usually 2 to 3 times faster than the cloud databases in the regular classes.

Local backup database: Time: 57 hours

DEV: Time: 22 hours 48 minutes Space increased: 442 GB

STG: Time: 17 hours 26 minutes Space increased: 458 GB

PRD: Time: 19 hours 31 minutes Space increased: 315 GB

fecjjeng commented 5 years ago

Unblock #3841 and moved it to ready column.

fecjjeng commented 5 years ago

A CMS ticket https://github.com/fecgov/fec-cms/issues/3054 had been opened to add the filter back on.

fecgov / openFEC

Add org type, designation to Schedule A and B tables #3787