Open patphongs opened 5 years ago
sort_null_last for sched_b turned on by modifying sched_b.py:
show_nulls_last_arg=False --> show_nulls_last_arg=True,
pytest passed on local and circleCI build is successful.
according to local tests, there is still a big issue on performance when sort_nulls_last is turned on:
--without sort_nulls_last: http://127.0.0.1:5000/v1/schedules/schedule_b/?line_number=F3X-23&sort_null_only=false&sort_hide_null=false&per_page=50&sort=-disbursement_date
response time: 1097ms
--with sort_nulls_last: http://127.0.0.1:5000/v1/schedules/schedule_b/?line_number=F3X-23&sort_nulls_last=true&sort_null_only=false&sort_hide_null=false&per_page=50&sort=-disbursement_date
response time: 399733ms
more testing on line_number:
F3X-21B: 1127ms vs 53155ms
F3X-28A: 1114ms vs 24623ms
Based on current testing against STG DB Performance is still a concern to turn on this sort_nulls_last flag.
Here is a list endpoints with performance testing result:
cycle141618_cmte_tp_w http://127.0.0.1:5000/v1/schedules/schedule_b/?spender_committee_type=W&sort=-disbursement_date&per_page=20&sort_null_only=false&sort_hide_null=false&api_key=DEMO_KEY&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&sort_nulls_last=true cycle141618_purpose_cat_other http://127.0.0.1:5000/v1/schedules/schedule_b/?sort=-disbursement_date&per_page=20&sort_null_only=false&disbursement_purpose_category=other&sort_hide_null=false&api_key=DEMO_KEY&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&sort_nulls_last=true cycle141618_line_num_F3X-23&sort_nulls_last=true http://127.0.0.1:5000/v1/schedules/schedule_b/?sort_hide_null=false&per_page=20&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&sort=-disbursement_date&api_key=DEMO_KEY&line_number=F3X-23&sort_null_only=false&sort_nulls_last=true cycle141618_desp_contribution2&sort_nulls_last=true http://127.0.0.1:5000/v1/schedules/schedule_b/?sort_hide_null=false&per_page=20&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&sort=-disbursement_date&api_key=DEMO_KEY&sort_null_only=false&disbursement_description=travel1&sort_nulls_last=true cycle141618_cmte_id_actblue http://127.0.0.1:5000/v1/schedules/schedule_b/?sort_hide_null=false&per_page=20&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&sort=-disbursement_date&committee_id=C00401224&api_key=DEMO_KEY&sort_null_only=false&sort_nulls_last=true cycel141618_recipient_st http://127.0.0.1:5000/v1/schedules/schedule_b/?sort_hide_null=false&per_page=20&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&sort=-disbursement_date&api_key=DEMO_KEY&sort_null_only=false&recipient_state=DC&sort_nulls_last=true cycle_141618_recipient_city http://127.0.0.1:5000/v1/schedules/schedule_b/?sort_hide_null=false&per_page=20&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&sort=-disbursement_date&api_key=DEMO_KEY&sort_null_only=false&recipient_city=washington&sort_nulls_last=true cycle141618_combine http://127.0.0.1:5000/v1/schedules/schedule_b/?spender_committee_type=W&sort=-disbursement_date&line_number=F3X-23&per_page=20&sort_null_only=false&disbursement_purpose_category=other&sort_hide_null=false&api_key=DEMO_KEY&committee_id=C00401224&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&disbursement_description=travel&sort_nulls_last=true
Request cycle141618_cmte_tp_w took: 420.24 seconds
Request cycle141618_purpose_cat_other took: 381.78 seconds
Request cycle141618_line_num_F3X-23&sort_nulls_last=true took: 384.0 seconds
Request cycle141618_desp_contribution2&sort_nulls_last=true took: 1.13 seconds
Request cycle141618_cmte_id_actblue took: 232.65 seconds
Request cycel141618_recipient_st took: 158.44 seconds
Request cycle_141618_recipient_city took: 152.31 seconds
Request cycle141618_combine took: 5.38 seconds
without sort_nulls_last:
Request cycle141618_cmte_tp_w took: 1.0 seconds
Request cycle141618_purpose_cat_other took: 1.08 seconds
Request cycle141618_line_num_F3X-23 took: 0.99 seconds
Request cycle141618_desp_contribution2 took: 1.01 seconds
Request cycle141618_cmte_id_actblue took: 0.99 seconds
Request cycel141618_recipient_st took: 1.08 seconds
Request cycle_141618_recipient_city took: 0.89 seconds
Request cycle141618_combine took: 831.89 seconds
quick summary:
Did some research on sort_nulls_last related performance issue on sched_b endpoints:
the queries are running very fast, comparing to queries with sort_nulls_last, which runs very slow. checking into the execute plan, the first queries are using the index scan and the second is not. See below diagrams:
first query without sort_nulls_last
second query with sort_nulls_last:
Further research on sched_b shows that sched_b is using a special disbursement_date sort_expressions:
-- see itemized.py around line 364:
@hybrid_property
def sort_expressions(self):
return {
'disbursement_date': {
'expression': sa.func.coalesce(
self.disbursement_date,
sa.cast('9999-12-31', sa.Date)
),
'field': ma.fields.Date,
'type': 'date',
'null_sort': self.disbursement_date,
},
}
when we build the disbursement_date index, we use the same coalease function and same dummy date '9999-12-31' data. this sorting will override sort_nulls_last and the null-items always show up in the beginning in DESC order and show up in the end in ASC order.
potential solutions based on the research and discussion with the team:
use a very old dummy date: like '1800-12-31'. This need to be updated both in the code and also in the index create sql to make sure the way we query data is aligned with the way the index is created.
get rid of the coalesce function and dummy date at all. According to the database team, the function and dummy date is added for performance purpose. With the improved database server, we may NOT need this change any more. The solution need more change on both code base and database side. So we'll try the first solution first.
- use a very old dummy date: like '1800-12-31'. This need to be updated both in the code and also in the index create sql to make sure the way we query data is aligned with the way the index is created.
@qqss88 @fecjjeng @rjayasekera In response to the above comment. The difficulty is that we can't guarantee that no one will enter a date like 1800-12-31
into their forms. I already see examples of disbursement dates of 0116-01-01
, see here.
Can we use a year that no one is likely to type in like 000001
, so that we are more likely to guarantee those values are sorted last?
I don’t like the idea of altering the data from what was submitted - what if someone is looking for that data and it doesn’t match what was submitted? I defer to @PaulClark2 though
@patphongs @qqss88 @lbeaufort we should not insert data into null fields. We'll just need to figure out how to make the null date sort more performant.
@PaulClark2 @lbeaufort @patphongs we are not altering the data, we are just use this dummy date to force the sorting. And it is not my idea, it is something we implemented before - we are using '9999-12-31' right now, so we want to change that to an ancient date to reverse the location of the null items.
Jean use dummy date '0001-01-01' to create 3 new index The same set of indexes are created as last time (CMTE_ID, CMTE_TP, and LINE_NUM) on 2016, 2018, and 2020. NULLS FIRST. Here is one index create script: CREATE INDEX idx_sched_b_2019_2020_cmte_id_colsc_disb_dt_sub_id_null_fst2 ON disclosure.fec_fitem_sched_b_2019_2020 USING btree (cmte_id COLLATE pg_catalog."default", (COALESCE(disb_dt, '0001-01-01'::date::timestamp without time zone)), sub_id) TABLESPACE pg_default;
Did some quick testings this morning. My top 3 slow queries are still running slow. Now even I flip ‘sort_nulls_last’ those queries took minutes to load. Good thing is the sorting is working now with nulls items showing in the end with DESC sorting. Beyond the slow queries I identified, everything else seems loading ok.
make a note of 'invalid disbursement date' data issue from some queries. e.g.:
items returned with invalid disbursement dates: "disbursement_date": "9201-11-01T00:00:00", "disbursement_date": "4201-07-02T00:00:00", "disbursement_date": "2617-07-03T00:00:00", "disbursement_date": "2114-12-31T00:00:00", "disbursement_date": "2114-12-30T00:00:00" "disbursement_date": "2114-12-30T00:00:00", . . . "disbursement_date": "2019-12-04T00:00:00",
Did another round of testing by disabling special 'sort_expression' on sched_b disbursement_date:
the SQL query looks like this: SELECT * FROM disclosure.fec_fitem_sched_b LEFT OUTER JOIN ofec_committee_history_mv AS ofec_committee_history_mv_1 ON disclosure.fec_fitem_sched_b.cmte_id = ofec_committee_history_mv_1.committee_id AND disclosure.fec_fitem_sched_b.two_year_transaction_period = ofec_committee_history_mv_1.cycle LEFT OUTER JOIN ofec_committee_history_mv AS ofec_committee_history_mv_2 ON disclosure.fec_fitem_sched_b.clean_recipient_cmte_id = ofec_committee_history_mv_2.committee_id AND disclosure.fec_fitem_sched_b.two_year_transaction_period = ofec_committee_history_mv_2.cycle WHERE disclosure.fec_fitem_sched_b.cmte_id IN (%(cmte_id_1)s) AND disclosure.fec_fitem_sched_b.two_year_transaction_period IN (%(two_year_transaction_period_1)s, %(two_year_transaction_period_2)s) ORDER BY disclosure.fec_fitem_sched_b.disb_dt DESC NULLS LAST, disclosure.fec_fitem_sched_b.sub_id DESC LIMIT %(param_1)s 2019-05-31 07:08:13,178 INFO sqlalchemy.engine.base.Engine {'cmte_id_1': 'C00401224', 'two_year_transaction_period_1': 2018, 'two_year_transaction_period_2': 2016, 'param_1': 20}
Current index:
-- Index: idx_sched_b_2017_2018_cmte_id_colsc_disb_dt_sub_id
-- DROP INDEX disclosure.idx_sched_b_2017_2018_cmte_id_colsc_disb_dt_sub_id;
CREATE INDEX idx_sched_b_2017_2018_cmte_id_colsc_disb_dt_sub_id ON disclosure.fec_fitem_sched_b_2017_2018 USING btree (cmte_id COLLATE pg_catalog."default", (COALESCE(disb_dt, '9999-12-31'::date::timestamp without time zone)), sub_id) TABLESPACE pg_default;
-- DROP INDEX disclosure.idx_sched_b_2017_2018_cmte_id_colsc_disb_dt_sub_id_null_fst;
CREATE INDEX idx_sched_b_2017_2018_cmte_id_colsc_disb_dt_sub_id_null_fst ON disclosure.fec_fitem_sched_b_2017_2018 USING btree (cmte_id COLLATE pg_catalog."default", (COALESCE(disb_dt, '9999-12-31'::date::timestamp without time zone)), sub_id) TABLESPACE pg_default;
-- Index: idx_sched_b_2017_2018_cmte_id_colsc_disb_dt_sub_id_null_fst2
-- DROP INDEX disclosure.idx_sched_b_2017_2018_cmte_id_colsc_disb_dt_sub_id_null_fst2;
CREATE INDEX idx_sched_b_2017_2018_cmte_id_colsc_disb_dt_sub_id_null_fst2 ON disclosure.fec_fitem_sched_b_2017_2018 USING btree (cmte_id COLLATE pg_catalog."default", (COALESCE(disb_dt, '0001-01-01'::date::timestamp without time zone)), sub_id) TABLESPACE pg_default;
Summary and quick conclusions:
Because this change involves re-building some sched_b index on across all partitions, after re-evaluate the priority, we decide to move this to Sprint 10.
Summary
What we're after: As a user, I want to be able to look through relevant disbursement data without being hindered by null values within the data I am searching.
Related PR
Completion criteria
sort_hide_null capability