fecgov / openFEC

The first RESTful API for the Federal Election Commission. We're aiming to make campaign finance more accessible for journalists, academics, developers, and other transparency seekers.
https://api.open.fec.gov/developers
Other
479 stars 106 forks source link

Enable sort_nulls_last parameter for schedule_b #3775

Open patphongs opened 5 years ago

patphongs commented 5 years ago

Summary

What we're after: As a user, I want to be able to look through relevant disbursement data without being hindered by null values within the data I am searching.

Screen Shot 2019-05-16 at 10 58 31 AM

Related PR

Completion criteria

qqss88 commented 5 years ago

sort_null_last for sched_b turned on by modifying sched_b.py:

show_nulls_last_arg=False --> show_nulls_last_arg=True,

qqss88 commented 5 years ago

pytest passed on local and circleCI build is successful.

qqss88 commented 5 years ago

according to local tests, there is still a big issue on performance when sort_nulls_last is turned on:

--without sort_nulls_last: http://127.0.0.1:5000/v1/schedules/schedule_b/?line_number=F3X-23&sort_null_only=false&sort_hide_null=false&per_page=50&sort=-disbursement_date

response time: 1097ms

--with sort_nulls_last: http://127.0.0.1:5000/v1/schedules/schedule_b/?line_number=F3X-23&sort_nulls_last=true&sort_null_only=false&sort_hide_null=false&per_page=50&sort=-disbursement_date

response time: 399733ms

qqss88 commented 5 years ago

more testing on line_number:

F3X-21B: 1127ms vs 53155ms

F3X-28A: 1114ms vs 24623ms

qqss88 commented 5 years ago

with cycles: http://127.0.0.1:5000/v1/schedules/schedule_b/?line_number=F3X-23&sort_nulls_last=true&sort_null_only=false&sort_hide_null=false&per_page=50&sort=-disbursement_date&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014 time: 376673ms

http://127.0.0.1:5000/v1/schedules/schedule_b/?line_number=F3X-23&sort_null_only=false&sort_hide_null=false&per_page=50&sort=-disbursement_date&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014 time: 1119ms

qqss88 commented 5 years ago

http://127.0.0.1:5000/v1/schedules/schedule_b/?line_number=F3X-23&sort_null_only=false&sort_hide_null=false&per_page=50&sort=-disbursement_date&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&recipient_city=new york time: 3603ms

http://127.0.0.1:5000/v1/schedules/schedule_b/?line_number=F3X-23&sort_nulls_last=true&sort_null_only=false&sort_hide_null=false&per_page=50&sort=-disbursement_date&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&recipient_city=new york time: 7921ms

qqss88 commented 5 years ago

http://127.0.0.1:5000/v1/schedules/schedule_b/?sort_nulls_last=true&sort_null_only=false&sort_hide_null=false&per_page=50&sort=-disbursement_date&two_year_transaction_period=2018

1107 vs 135437ms

qqss88 commented 5 years ago

http://127.0.0.1:5000/v1/schedules/schedule_b/?sort_nulls_last=true&sort_null_only=false&sort_hide_null=false&per_page=50&sort=-disbursement_date&two_year_transaction_period=2018&two_year_transaction_period=2016&min_amount=100&max_amount=5000&wo_year_transaction_period=2014 41709 vs 1105

qqss88 commented 5 years ago

http://127.0.0.1:5000/v1/schedules/schedule_b/?sort_nulls_last=true&sort_null_only=false&sort_hide_null=false&per_page=50&sort=-disbursement_date&two_year_transaction_period=2018&two_year_transaction_period=2016&min_amount=50&max_amount=5000&wo_year_transaction_period=2014 time: 61550 vs 1087

qqss88 commented 5 years ago

Based on current testing against STG DB Performance is still a concern to turn on this sort_nulls_last flag.

qqss88 commented 5 years ago

http://127.0.0.1:5000/v1/schedules/schedule_b/?sort=-disbursement_date&per_page=20&sort_null_only=false&sort_hide_null=false&api_key=DEMO_KEY&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&disbursement_purpose_category=other&sort_nulls_last=true time: 367616 vs 979

qqss88 commented 5 years ago

http://127.0.0.1:5000/v1/schedules/schedule_b/?sort=-disbursement_date&per_page=20&sort_null_only=false&sort_hide_null=false&api_key=DEMO_KEY&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&sort_nulls_last=true&committee_id=C00401224 Time: 229535 vs 974

qqss88 commented 5 years ago

Here is a list endpoints with performance testing result:

cycle141618_cmte_tp_w http://127.0.0.1:5000/v1/schedules/schedule_b/?spender_committee_type=W&sort=-disbursement_date&per_page=20&sort_null_only=false&sort_hide_null=false&api_key=DEMO_KEY&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&sort_nulls_last=true cycle141618_purpose_cat_other http://127.0.0.1:5000/v1/schedules/schedule_b/?sort=-disbursement_date&per_page=20&sort_null_only=false&disbursement_purpose_category=other&sort_hide_null=false&api_key=DEMO_KEY&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&sort_nulls_last=true cycle141618_line_num_F3X-23&sort_nulls_last=true http://127.0.0.1:5000/v1/schedules/schedule_b/?sort_hide_null=false&per_page=20&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&sort=-disbursement_date&api_key=DEMO_KEY&line_number=F3X-23&sort_null_only=false&sort_nulls_last=true cycle141618_desp_contribution2&sort_nulls_last=true http://127.0.0.1:5000/v1/schedules/schedule_b/?sort_hide_null=false&per_page=20&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&sort=-disbursement_date&api_key=DEMO_KEY&sort_null_only=false&disbursement_description=travel1&sort_nulls_last=true cycle141618_cmte_id_actblue http://127.0.0.1:5000/v1/schedules/schedule_b/?sort_hide_null=false&per_page=20&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&sort=-disbursement_date&committee_id=C00401224&api_key=DEMO_KEY&sort_null_only=false&sort_nulls_last=true cycel141618_recipient_st http://127.0.0.1:5000/v1/schedules/schedule_b/?sort_hide_null=false&per_page=20&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&sort=-disbursement_date&api_key=DEMO_KEY&sort_null_only=false&recipient_state=DC&sort_nulls_last=true cycle_141618_recipient_city http://127.0.0.1:5000/v1/schedules/schedule_b/?sort_hide_null=false&per_page=20&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&sort=-disbursement_date&api_key=DEMO_KEY&sort_null_only=false&recipient_city=washington&sort_nulls_last=true cycle141618_combine http://127.0.0.1:5000/v1/schedules/schedule_b/?spender_committee_type=W&sort=-disbursement_date&line_number=F3X-23&per_page=20&sort_null_only=false&disbursement_purpose_category=other&sort_hide_null=false&api_key=DEMO_KEY&committee_id=C00401224&two_year_transaction_period=2018&two_year_transaction_period=2016&two_year_transaction_period=2014&disbursement_description=travel&sort_nulls_last=true

Request cycle141618_cmte_tp_w took: 420.24 seconds

Request cycle141618_purpose_cat_other took: 381.78 seconds

Request cycle141618_line_num_F3X-23&sort_nulls_last=true took: 384.0 seconds

Request cycle141618_desp_contribution2&sort_nulls_last=true took: 1.13 seconds

Request cycle141618_cmte_id_actblue took: 232.65 seconds

Request cycel141618_recipient_st took: 158.44 seconds

Request cycle_141618_recipient_city took: 152.31 seconds

Request cycle141618_combine took: 5.38 seconds

qqss88 commented 5 years ago

without sort_nulls_last:

Request cycle141618_cmte_tp_w took: 1.0 seconds

Request cycle141618_purpose_cat_other took: 1.08 seconds

Request cycle141618_line_num_F3X-23 took: 0.99 seconds

Request cycle141618_desp_contribution2 took: 1.01 seconds

Request cycle141618_cmte_id_actblue took: 0.99 seconds

Request cycel141618_recipient_st took: 1.08 seconds

Request cycle_141618_recipient_city took: 0.89 seconds

Request cycle141618_combine took: 831.89 seconds

qqss88 commented 5 years ago

quick summary:

  1. based on performance testing data collected, sort_nulls_last is still a performance drag on some sched_b data filtering.
  2. part of sorting and null value issues can be resolved using the 'sort_hide_null' flag(which already turned on for sched_b) - we may need to consider how to make this more interactive on the front end.
qqss88 commented 5 years ago

Did some research on sort_nulls_last related performance issue on sched_b endpoints:

  1. when sorting without sort_nulls_last - by default, null-value items show up at the beginning of the results:

the queries are running very fast, comparing to queries with sort_nulls_last, which runs very slow. checking into the execute plan, the first queries are using the index scan and the second is not. See below diagrams:

first query without sort_nulls_last image

second query with sort_nulls_last:

image

qqss88 commented 5 years ago

Further research on sched_b shows that sched_b is using a special disbursement_date sort_expressions:

-- see itemized.py around line 364:

@hybrid_property
def sort_expressions(self):
    return {
        'disbursement_date': {
            'expression': sa.func.coalesce(
                self.disbursement_date,
                sa.cast('9999-12-31', sa.Date)
            ),
            'field': ma.fields.Date,
            'type': 'date',
            'null_sort': self.disbursement_date,
        },
    }

when we build the disbursement_date index, we use the same coalease function and same dummy date '9999-12-31' data. this sorting will override sort_nulls_last and the null-items always show up in the beginning in DESC order and show up in the end in ASC order.

qqss88 commented 5 years ago

potential solutions based on the research and discussion with the team:

  1. use a very old dummy date: like '1800-12-31'. This need to be updated both in the code and also in the index create sql to make sure the way we query data is aligned with the way the index is created.

  2. get rid of the coalesce function and dummy date at all. According to the database team, the function and dummy date is added for performance purpose. With the improved database server, we may NOT need this change any more. The solution need more change on both code base and database side. So we'll try the first solution first.

patphongs commented 5 years ago
  1. use a very old dummy date: like '1800-12-31'. This need to be updated both in the code and also in the index create sql to make sure the way we query data is aligned with the way the index is created.

@qqss88 @fecjjeng @rjayasekera In response to the above comment. The difficulty is that we can't guarantee that no one will enter a date like 1800-12-31 into their forms. I already see examples of disbursement dates of 0116-01-01, see here.

Can we use a year that no one is likely to type in like 000001, so that we are more likely to guarantee those values are sorted last?

lbeaufort commented 5 years ago

I don’t like the idea of altering the data from what was submitted - what if someone is looking for that data and it doesn’t match what was submitted? I defer to @PaulClark2 though

PaulClark2 commented 5 years ago

@patphongs @qqss88 @lbeaufort we should not insert data into null fields. We'll just need to figure out how to make the null date sort more performant.

qqss88 commented 5 years ago

@PaulClark2 @lbeaufort @patphongs we are not altering the data, we are just use this dummy date to force the sorting. And it is not my idea, it is something we implemented before - we are using '9999-12-31' right now, so we want to change that to an ancient date to reverse the location of the null items.

qqss88 commented 5 years ago

Jean use dummy date '0001-01-01' to create 3 new index The same set of indexes are created as last time (CMTE_ID, CMTE_TP, and LINE_NUM) on 2016, 2018, and 2020. NULLS FIRST. Here is one index create script: CREATE INDEX idx_sched_b_2019_2020_cmte_id_colsc_disb_dt_sub_id_null_fst2 ON disclosure.fec_fitem_sched_b_2019_2020 USING btree (cmte_id COLLATE pg_catalog."default", (COALESCE(disb_dt, '0001-01-01'::date::timestamp without time zone)), sub_id) TABLESPACE pg_default;

Did some quick testings this morning. My top 3 slow queries are still running slow. Now even I flip ‘sort_nulls_last’ those queries took minutes to load. Good thing is the sorting is working now with nulls items showing in the end with DESC sorting. Beyond the slow queries I identified, everything else seems loading ok.

qqss88 commented 5 years ago

make a note of 'invalid disbursement date' data issue from some queries. e.g.:

http://127.0.0.1:5000/v1/schedules/schedule_b/?line_number=F3X-21B&sort_null_only=false&sort_hide_null=false&per_page=20&sort=-disbursement_date&two_year_transaction_period=2018&two_year_transaction_period=2016

items returned with invalid disbursement dates: "disbursement_date": "9201-11-01T00:00:00", "disbursement_date": "4201-07-02T00:00:00", "disbursement_date": "2617-07-03T00:00:00", "disbursement_date": "2114-12-31T00:00:00", "disbursement_date": "2114-12-30T00:00:00" "disbursement_date": "2114-12-30T00:00:00", . . . "disbursement_date": "2019-12-04T00:00:00",

qqss88 commented 5 years ago

Did another round of testing by disabling special 'sort_expression' on sched_b disbursement_date:

http://127.0.0.1:5000/v1/schedules/schedule_b/?sort=-disbursement_date&per_page=20&sort_null_only=false&sort_hide_null=false&api_key=DEMO_KEY&two_year_transaction_period=2018&two_year_transaction_period=2016&sort_nulls_last=true&committee_id=C00401224 time: 197280ms vs 192434ms

the SQL query looks like this: SELECT * FROM disclosure.fec_fitem_sched_b LEFT OUTER JOIN ofec_committee_history_mv AS ofec_committee_history_mv_1 ON disclosure.fec_fitem_sched_b.cmte_id = ofec_committee_history_mv_1.committee_id AND disclosure.fec_fitem_sched_b.two_year_transaction_period = ofec_committee_history_mv_1.cycle LEFT OUTER JOIN ofec_committee_history_mv AS ofec_committee_history_mv_2 ON disclosure.fec_fitem_sched_b.clean_recipient_cmte_id = ofec_committee_history_mv_2.committee_id AND disclosure.fec_fitem_sched_b.two_year_transaction_period = ofec_committee_history_mv_2.cycle WHERE disclosure.fec_fitem_sched_b.cmte_id IN (%(cmte_id_1)s) AND disclosure.fec_fitem_sched_b.two_year_transaction_period IN (%(two_year_transaction_period_1)s, %(two_year_transaction_period_2)s) ORDER BY disclosure.fec_fitem_sched_b.disb_dt DESC NULLS LAST, disclosure.fec_fitem_sched_b.sub_id DESC LIMIT %(param_1)s 2019-05-31 07:08:13,178 INFO sqlalchemy.engine.base.Engine {'cmte_id_1': 'C00401224', 'two_year_transaction_period_1': 2018, 'two_year_transaction_period_2': 2016, 'param_1': 20}

qqss88 commented 5 years ago

http://127.0.0.1:5000/v1/schedules/schedule_b/?line_number=F3X-23&sort_nulls_last=true&sort_null_only=false&sort_hide_null=false&per_page=20&sort=-disbursement_date&two_year_transaction_period=2018&two_year_transaction_period=2016 time: 312713ms vs 500ERROR

qqss88 commented 5 years ago

Current index:

  1. -- Index: idx_sched_b_2017_2018_cmte_id_colsc_disb_dt_sub_id

-- DROP INDEX disclosure.idx_sched_b_2017_2018_cmte_id_colsc_disb_dt_sub_id;

CREATE INDEX idx_sched_b_2017_2018_cmte_id_colsc_disb_dt_sub_id ON disclosure.fec_fitem_sched_b_2017_2018 USING btree (cmte_id COLLATE pg_catalog."default", (COALESCE(disb_dt, '9999-12-31'::date::timestamp without time zone)), sub_id) TABLESPACE pg_default;

    • Index: idx_sched_b_2017_2018_cmte_id_colsc_disb_dt_sub_id_null_fst

-- DROP INDEX disclosure.idx_sched_b_2017_2018_cmte_id_colsc_disb_dt_sub_id_null_fst;

CREATE INDEX idx_sched_b_2017_2018_cmte_id_colsc_disb_dt_sub_id_null_fst ON disclosure.fec_fitem_sched_b_2017_2018 USING btree (cmte_id COLLATE pg_catalog."default", (COALESCE(disb_dt, '9999-12-31'::date::timestamp without time zone)), sub_id) TABLESPACE pg_default;

  1. -- Index: idx_sched_b_2017_2018_cmte_id_colsc_disb_dt_sub_id_null_fst2

-- DROP INDEX disclosure.idx_sched_b_2017_2018_cmte_id_colsc_disb_dt_sub_id_null_fst2;

CREATE INDEX idx_sched_b_2017_2018_cmte_id_colsc_disb_dt_sub_id_null_fst2 ON disclosure.fec_fitem_sched_b_2017_2018 USING btree (cmte_id COLLATE pg_catalog."default", (COALESCE(disb_dt, '0001-01-01'::date::timestamp without time zone)), sub_id) TABLESPACE pg_default;

qqss88 commented 5 years ago

Summary and quick conclusions:

  1. sched_b endpoint is using a special sorting expression with coalesce_date(-- see itemized.py around line 364). because of this sorting expression, sort_null and sort_nulls_last are overidden by dummy date sorting.
  2. by changing the dummy date from '9999-12-31' to '0001-01-01' we can achieve dsibursement_date sorting on DESC with nulls_last.
  3. part of index(partition 2020, 2018, 2016) updated on stage for testing, most queries are loading efficiently. Some slow queries identified: cmte_id=C00401224, cmte_tp='W' and line_number='f3x-23'. See above testings for details. The updated index does not see to help.
  4. there is another option of removing this sort_expression at all(this is added for performance purpose, but it is only sched_b, not sched_a). I disabled it and did some testing. Sorting is working correctly. But those slow queries are still slow and need to improved by adding non-coalesced index.

Because this change involves re-building some sched_b index on across all partitions, after re-evaluate the priority, we decide to move this to Sprint 10.