fecgov / openFEC

The first RESTful API for the Federal Election Commission. We're aiming to make campaign finance more accessible for journalists, academics, developers, and other transparency seekers.
https://api.open.fec.gov/developers
Other
479 stars 106 forks source link

(disclosure.fec_fitem_sched_a.sub_id) < (4101920051061418807) not use PK index, need add fec_fitem_sched_a.two_year_transaction_period filter + Union all or "angular " for 60 times faster #4616

Open dzhang-fec opened 4 years ago

dzhang-fec commented 4 years ago

summary: before change 30+min vs. after 1.6min

For this kind of "ALL Years' search, add this in the sql disclosure.fec_fitem_sched_a.two_year_transaction_period IN (2020) This is to use the index of the partitions we have now. (Index Scan Backward using idx_sched_a_2019_2020_two_year_period_dt_sub_id ) 2)use "UNION all" to union all of the years, to keep the same sql return. Every 10 years trunc as @lbeaufort did for cmd_id. 3)when it is available, consider to change this: (replace the (disclosure.fec_fitem_sched_a.sub_id) < (4101920051061418807) with timestamp contb_receipt_dt 4)when it is available, For long term, we get n-times more times data every 2 years. I think we may have to implement "angular framework" for this kind of GUI pagination

image

3)Code suggestion as below (but it needs code change). see https://github.com/fecgov/openFEC/issues/4414

Problem: the follow is slow, run 30+mins per our performance monitoring

SELECT FROM disclosure.fec_fitem_sched_a LEFT OUTER JOIN ofec_committee_history_mv AS ofec_committee_history_mv_1 ON disclosure.fec_fitem_sched_a.cmte_id = ofec_committee_history_mv_1.committee_id AND disclosure.fec_fitem_sched_a.two_year_transaction_period = ofec_committee_history_mv_1.cycle LEFT OUTER JOIN ofec_committee_history_mv AS ofec_committee_history_mv_2 ON disclosure.fec_fitem_sched_a.clean_contbr_id = ofec_committee_history_mv_2.committee_id AND disclosure.fec_fitem_sched_a.two_year_transaction_period = ofec_committee_history_mv_2.cycle WHERE disclosure.fec_fitem_sched_a.is_individual = true AND disclosure.fec_fitem_sched_a.contributor_employer_text @@ to_tsquery('SAIC:') AND (disclosure.fec_fitem_sched_a.sub_id) < (4101920051061418807) ORDER BY disclosure.fec_fitem_sched_a.contb_receipt_dt DESC, disclosure.fec_fitem_sched_a.sub_id DESC LIMIT 100

For this kind of "ALL Years' search, add this in the sql disclosure.fec_fitem_sched_a.two_year_transaction_period IN (2020) This is to use the index of the partitions we have now. (Index Scan Backward using idx_sched_a_2019_2020_two_year_period_dt_sub_id ) 2)use "UNION all" to union all of the years, to keep the same sql return. Every 10 years trunc as @lbeaufort did for cmd_id. 3)when it is available, consider to change this: (replace the (disclosure.fec_fitem_sched_a.sub_id) < (4101920051061418807) with timestamp contb_receipt_dt 4)when it is available, For long term, we get n-times more times data every 2 years. I think we may have to implement "angular framework" for this kind of GUI pagination

dzhang-fec commented 4 years ago

It happens again recently image

FEC-CLOUD-AWSAPP 4:30 PM Sample of a long running Query 00:27:10 SELECT * FROM disclosure.fec_fitem_sched_a LEFT OUTER JOIN ofec_committee_history_mv AS ofec_committee_history_mv_1 ON disclosure.fec_fitem_sched_a.cmte_id = ofec_committee_history_mv_1.committee_id AND disclosure.fec_fitem_sched_a.two_year_transaction_period = ofec_committee_history_mv_1.cycle LEFT OUTER JOIN ofec_committee_history_mv AS ofec_committee_history_mv_2 ON disclosure.fec_fitem_sched_a.clean_contbr_id = ofec_committee_history_mv_2.committee_id AND disclosure.fec_fitem_sched_a.two_year_transaction_period = ofec_committee_history_mv_2.cycle WHERE disclosure.fec_fitem_sched_a.is_individual = true AND disclosure.fec_fitem_sched_a.contributor_name_text @@ to_tsquery('douglas: & fox:') AND (disclosure.fec_fitem_sched_a.sub_id) < (4071120051059063148) ORDER BY disclosure.fec_fitem_sched_a.contb_receipt_dt DESC, disclosure.fec_fitem_sched_a.sub_id DESC LIMIT 30 Show less

dzhang-fec commented 4 years ago

I only care about the pagination "performance" (27mins and will be no result to the end clients and the database over loading (27mins long sql) . However, I am not on a specific pagination logic. I am trying to ask if we have plan to use the "angular JS" for the pagination performance for a long term.

Jason Upchurch Today at 11:02 AM Thanks @David Z.--I’m fine to join whenever is good (except that I’ll be out Monday). I want to understand the question and ideas you may have and would be happy to join in. Your angular question really goes to front-end architecture I think, right? It probably can’t hurt to invite some front-end folks to the discussion just so we can all be aware of the issues and learn from each other (consider @pat @Robert @jcarroll) Some questions I would have in our discussion would be along the lines of: Is the main context of this discussion how we can improve the performance of db queries? would the use of Angular intend to bypass the API and have the front-end query the db directly? Or more directly, how does Angular fit into the db query itself or into your question? What are the implications for the existing framework of using the API/sqlalchemy to query the db? How does the UNION ALL work fit into this discussion, if at all? :+1: 2

John Carroll 10 hours ago @Jason Upchurch , definitely sounds like a worthwhile discussion. Count me in :slightly_smiling_face: 1

David Z. 4 hours ago @Jason Upchurch Thanks for your responese. 1)Yes, it is front-end architecture, for our performance of web pagination.. AngularJS's data binding and dependency injection eliminate much of our python codes with security. 2)how we can improve the performance of db queries? In the ticket, the current- after-1st page takes 27+ mins which is the top 3 slowest sqls. We need to consider to use UNION ALL that @laura is doing (see ticket). And we may consider to use timestamp instead of the sub_id keys if it is ok. Also, we can reduce this slow sql run frequency too (i.e., just run one time API but for all of pages of pagination. Not need to access database for every page..). It is not secure to see the database access info in our code, so all database queries will go via our APIs (that use existing sqlalchemy to query the DB) and give us JSON etc for pagination. The UNION ALL is just a sql level tuning for that API, which will improve the the query performance of the related API. Thanks.