Closed chouinar closed 1 month ago
@chouinar If the this needs to be an external query, I'd make the following research recommendations:
1) Consider replacing the lookup table joins with recursive cte queries ('with' clause). Should be faster. While the db will attempt to inline the lookup data it can't always do so, and will often do so fully, before filtering
2) Consider replacing DISTINCT with a lateral cross join or intersect (depending on your logical needs)
3) Consider using a materialized view to replace the joins. Lookup tables are pretty static, and this can be fully automated with a refresh trigger. Flattening the data without strict denormalization will eliminate a lot of the work the db needs to do for search
4) Use deterministic functions for lookup values instead of joins. This is a little risky since you don't have a data profile yet; it can be dramatically faster (especially if you do the ranking there), but will also incur context-switching penalties and likely will show a higher cost. The advantage is for tables with low cardinality, where the function is called once and the result cached
@chouinar Echoing Brett's comments above, do you have the SQLAlchemy query that generates this? I have a feeling some of the (potential) inefficiencies might be stemming from some of the translation that happens as a result of the ORM.
@widal001 - https://github.com/HHS/simpler-grants-gov/blob/main/api/src/services/opportunities_v0_1/search_opportunities.py is where the SQLAlchemy queries are built. I built the queries first, and then implemented them in SQLAlchemy, so the implementation itself wasn't the constraining factor for the most part.
Definitely worth discussing the queries a bit more. I'll admit I'm not an expert when it comes to these very complex types of queries. I'll look to setup some time when I start working on this to figure out how we want to build out the queries. From what Brett put above, I think he'll have a much better idea for how to build some of this out than I would come up with.
@chouinar Gotcha!! Thanks for linking this
Definitely worth discussing the queries a bit more. I'll admit I'm not an expert when it comes to these very complex types of queries. I'll look to setup some time when I start working on this to figure out how we want to build out the queries. From what Brett put above, I think he'll have a much better idea for how to build some of this out than I would come up with.
Would it be helpful for me to noodle on optimizing/simplifying this query with you to try to incorporate some of Brett's suggestions? Or are you excited to dig into this further yourself?
@widal001 - Sure I'd be happy to chat through some ideas
@chouinar Sounds great! I'll give this a bit more thought over the next couple of days and see if I can share an updated query with you that we can work on translating back to SQLAlchemy
@widal001 - happy to pair on it, or if you want to look a bit on your own, one tip I can give when running the DB locally: If yo want a lot of data locally, I recently updated the seed script to take in an iteration count to produce more data easily. make db-seed-local args="--iterations 300"
will generate ~10k opportunities (and records across several other tables), although be warned this takes like 10 minutes to run.
@chouinar Here's a simple example of an approach that I think might help us with adding a relevancy score to the results that we can search by.
The basic strategy here is:
WHERE
use a CASE WHEN
statement to assign that match a given weighted value (e.g. for example if we want to weight matches against title higher than against description)opportunity_summary_id
to get a score for total number of matched rows for a given opportunity summary.match_
columns to get a match_score
valuematch_score
is greater than 1 and then sort by match_score
to get a relevancy sort.Here's some SQL that you can run in a local SQLite db to check the basic approach:
-- ###################################
-- Drop tables
-- ###################################
DROP TABLE IF EXISTS opportunity;
DROP TABLE IF EXISTS opportunity_summary;
DROP TABLE IF EXISTS current_opportunity_summary;
DROP TABLE IF EXISTS lookup_attribute_a;
DROP TABLE IF EXISTS link_attribute_a;
-- ###################################
-- Create tables
-- ###################################
CREATE TABLE opportunity (
opportunity_id INT PRIMARY KEY,
opportunity_title TEXT,
agency TEXT
);
CREATE TABLE opportunity_summary (
opportunity_summary_id INT PRIMARY KEY,
opportunity_id INT,
summary_description TEXT
);
CREATE TABLE current_opportunity_summary (
opportunity_id INT,
opportunity_summary_id INT
);
CREATE TABLE lookup_attribute_a (
attribute_a_id INT,
description TEXT
);
CREATE TABLE link_attribute_a (
opportunity_summary_id INT,
attribute_a_id INT
);
-- ###################################
-- Populate tables
-- ###################################
INSERT INTO opportunity(opportunity_id, opportunity_title, agency) VALUES (1, 'Solar panel grant', 'Environmental Protection Agency');
INSERT INTO opportunity(opportunity_id, opportunity_title, agency) VALUES (2, 'Green power grant', 'Environmental Protection Agency');
INSERT INTO opportunity_summary (opportunity_id, opportunity_summary_id, summary_description) VALUES (1, 1, 'Grants for solar panels');
INSERT INTO opportunity_summary (opportunity_id, opportunity_summary_id, summary_description) VALUES (2, 2, 'Earn credits for hydropower projects');
INSERT INTO opportunity_summary (opportunity_id, opportunity_summary_id, summary_description) VALUES (2, 3, 'Earn credits for hydropower and solar projects');
INSERT INTO current_opportunity_summary (opportunity_id, opportunity_summary_id) VALUES (1, 1);
INSERT INTO current_opportunity_summary (opportunity_id, opportunity_summary_id) VALUES (2, 3);
INSERT INTO lookup_attribute_a (attribute_a_id, description) VALUES (1, 'solar');
INSERT INTO lookup_attribute_a (attribute_a_id, description) VALUES (2, 'hydro');
INSERT INTO lookup_attribute_a (attribute_a_id, description) VALUES (3, 'electricity');
INSERT INTO lookup_attribute_a (attribute_a_id, description) VALUES (4, 'solar panels');
INSERT INTO link_attribute_a (opportunity_summary_id, attribute_a_id) VALUES (1, 1);
INSERT INTO link_attribute_a (opportunity_summary_id, attribute_a_id) VALUES (1, 3);
INSERT INTO link_attribute_a (opportunity_summary_id, attribute_a_id) VALUES (1, 4);
INSERT INTO link_attribute_a (opportunity_summary_id, attribute_a_id) VALUES (3, 1);
INSERT INTO link_attribute_a (opportunity_summary_id, attribute_a_id) VALUES (3, 2);
INSERT INTO link_attribute_a (opportunity_summary_id, attribute_a_id) VALUES (3, 3);
INSERT INTO link_attribute_a (opportunity_summary_id, attribute_a_id) VALUES (3, 4);
-- ###################################
-- Search opportunities for 'solar'
-- ###################################
SELECT ' ';
SELECT 'Search opportunities for SOLAR';
-- Get current opportunity and create weighted scores for individual attribute matches
WITH current_opportunity AS (
SELECT
opportunity.opportunity_id,
opportunity.opportunity_title,
opportunity.agency,
summary.opportunity_summary_id,
summary.summary_description,
CASE
WHEN UPPER(opportunity.opportunity_title) LIKE '%SOLAR%'
THEN 2
ELSE 0
END AS match_title,
CASE
WHEN UPPER(summary.summary_description) LIKE '%SOLAR%'
THEN 1
ELSE 0
END AS match_description,
CASE
WHEN UPPER(opportunity.agency) LIKE '%SOLAR%'
THEN 1
ELSE 0
END AS match_agency
FROM opportunity
JOIN current_opportunity_summary AS current_summary ON current_summary.opportunity_id = opportunity.opportunity_id
JOIN opportunity_summary AS summary ON current_summary.opportunity_summary_id = summary.opportunity_summary_id
),
-- Filter link_attribute_a to only include records where attribute a matches search criterion
attribute_a_match_all AS (
SELECT
link.opportunity_summary_id,
lookup.description,
1 AS match_attr_a
FROM lookup_attribute_a AS lookup
JOIN link_attribute_a AS link ON lookup.attribute_a_id = link.attribute_a_id
WHERE UPPER(description) LIKE '%SOLAR%'
),
-- Group by opportunity_summary_id to sum total matches and prevent fan outs
attribute_a_match_agg AS (
SELECT
opportunity_summary_id,
SUM(match_attr_a) AS match_attr_a
FROM attribute_a_match_all
GROUP BY opportunity_summary_id
)
SELECT
current_opportunity.opportunity_id,
current_opportunity.opportunity_title,
current_opportunity.agency,
current_opportunity.summary_description,
(
current_opportunity.match_agency
+ current_opportunity.match_title
+ current_opportunity.match_description
+ attribute_a_match_agg.match_attr_a
) AS match_score
FROM current_opportunity
LEFT JOIN attribute_a_match_agg ON attribute_a_match_agg.opportunity_summary_id = current_opportunity.opportunity_summary_id
WHERE match_score >= 1
ORDER BY match_score DESC;
-- ###################################
-- Search opportunities for 'hydro'
-- ###################################
SELECT ' ';
SELECT 'Search opportunities for HYDRO';
-- Get current opportunity and create weighted scores for individual attribute matches
WITH current_opportunity AS (
SELECT
opportunity.opportunity_id,
opportunity.opportunity_title,
opportunity.agency,
summary.opportunity_summary_id,
summary.summary_description,
CASE
WHEN UPPER(opportunity.opportunity_title) LIKE '%HYDRO%'
THEN 2
ELSE 0
END AS match_title,
CASE
WHEN UPPER(summary.summary_description) LIKE '%HYDRO%'
THEN 1
ELSE 0
END AS match_description,
CASE
WHEN UPPER(opportunity.agency) LIKE '%HYDRO%'
THEN 1
ELSE 0
END AS match_agency
FROM opportunity
JOIN current_opportunity_summary AS current_summary ON current_summary.opportunity_id = opportunity.opportunity_id
JOIN opportunity_summary AS summary ON current_summary.opportunity_summary_id = summary.opportunity_summary_id
),
-- Filter link_attribute_a to only include records where attribute a matches search criterion
attribute_a_match_all AS (
SELECT
link.opportunity_summary_id,
lookup.description,
1 AS match_attr_a
FROM lookup_attribute_a AS lookup
JOIN link_attribute_a AS link ON lookup.attribute_a_id = link.attribute_a_id
-- make sure we're only counting matches for the current summary
JOIN current_opportunity_summary AS summary ON summary.opportunity_summary_id = link.opportunity_summary_id
WHERE UPPER(description) LIKE '%HYDRO%'
),
-- Group by opportunity_summary_id to sum total matches and prevent fan outs
attribute_a_match_agg AS (
SELECT
opportunity_summary_id,
SUM(match_attr_a) AS match_attr_a
FROM attribute_a_match_all
GROUP BY opportunity_summary_id
)
SELECT
current_opportunity.opportunity_id,
current_opportunity.opportunity_title,
current_opportunity.agency,
current_opportunity.summary_description,
(
current_opportunity.match_agency
+ current_opportunity.match_title
+ current_opportunity.match_description
+ attribute_a_match_agg.match_attr_a
) AS match_score
FROM current_opportunity
LEFT JOIN attribute_a_match_agg ON attribute_a_match_agg.opportunity_summary_id = current_opportunity.opportunity_summary_id
WHERE match_score >= 1
ORDER BY match_score DESC;
It should print out the following results:
sqlite> .read init_db.sql
' '
'Search opportunities for SOLAR'
Search opportunities for SOLAR
opportunity_id|opportunity_title|agency|summary_description|match_score
1|Solar panel grant|Environmental Protection Agency|Grants for solar panels|5
2|Green power grant|Environmental Protection Agency|Earn credits for hydropower and solar projects|3
' '
'Search opportunities for HYDRO'
Search opportunities for HYDRO
opportunity_id|opportunity_title|agency|summary_description|match_score
2|Green power grant|Environmental Protection Agency|Earn credits for hydropower and solar projects|2
Been chatting in a separate thread, pulling a few ideas together, a current work-in-progress query idea for scoring.
I know it has a few issues (needs some coalesces for null values), but putting this here for discussion.
WITH RECURSIVE summary_scores AS (
SELECT
opportunity_id,
word_similarity ('health',
summary_description) AS summary_score
FROM
opportunity_summary
),
opportunity_scores AS (
SELECT
opportunity_id,
word_similarity ('health',
opportunity_title) AS title_score,
word_similarity ('health',
opportunity_number) AS number_score,
word_similarity ('health',
agency) AS agency_score
FROM
opportunity
),
assistance_listing_scores AS (
SELECT
opportunity_id,
word_similarity ('health',
assistance_listing_number) AS al_number_score,
word_similarity ('health',
program_title) AS program_title_score
FROM
opportunity_assistance_listing
),
agg_assistance_listing_scores AS (
SELECT
opportunity_id,
SUM(al_number_score) AS al_number_score,
SUM(program_title_score) AS program_title_score
FROM
assistance_listing_scores
GROUP BY
opportunity_id
),
score_sum AS (
SELECT
summary_scores.opportunity_id,
(summary_scores.summary_score + opportunity_scores.title_score + opportunity_scores.number_score + opportunity_scores.agency_score + agg_assistance_listing_scores.al_number_score + agg_assistance_listing_scores.program_title_score) AS total_score
FROM
summary_scores
LEFT JOIN opportunity_scores ON summary_scores.opportunity_id = opportunity_scores.opportunity_id
LEFT JOIN agg_assistance_listing_scores ON summary_scores.opportunity_id = agg_assistance_listing_scores.opportunity_id
)
SELECT
opportunity.opportunity_id,
opportunity.opportunity_title,
opportunity.agency,
opportunity_summary.summary_description,
score_sum.total_score
FROM
opportunity
JOIN current_opportunity_summary ON opportunity.opportunity_id = current_opportunity_summary.opportunity_id
JOIN opportunity_summary ON current_opportunity_summary.opportunity_summary_id = opportunity_summary.opportunity_summary_id,
score_sum
WHERE
opportunity.opportunity_id = score_sum.opportunity_id
ORDER BY
score_sum.total_score DESC nulls LAST
LIMIT 25 OFFSET 1
Here's the query with scores and counts (overall and per id) from the chats with recursive summa as (select opportunity_id, opportunity_summary_id, summary_description, word_similarity('hydra solar', summary_description) as dif, 'hydra' <-> summary_description as dist, count() over () as cnt, count() over (partition by opportunity_id) as cnt_per_id from opportunity_summary ) select opportunity.opportunity_id, opportunity.opportunity_title, opportunity.agency, summa.summary_description, summa.dif as matching_factor, summa.dist as distance, summa.cnt, summa.cnt_per_id from opportunity, summa where opportunity.opportunity_id = summa.opportunity_id;
Here's an update to my above query, still missing a few things like coalesces/null handling for the scoring.
I added a filter_query
which handles the filters (eg. you want just forecasted
and posted
opportunity statuses). I also made it so the score queries all reference it, which I actually think is required. Because the score queries are just going against a particular table like the opportunity summary, they may be calculating scores for past summaries, or ones that aren't the current. When we aggregate the scores, that could cause the score to be higher for something with a lot of revisions, which we wouldn't want, so those queries need to either have the logic that joins them to the current opportunity table, or instead just against the query that already handles that.
WITH RECURSIVE
filter_query AS (
SELECT
opportunity.opportunity_id, opportunity_summary.opportunity_summary_id
FROM
opportunity
JOIN current_opportunity_summary ON opportunity.opportunity_id = current_opportunity_summary.opportunity_id
JOIN opportunity_summary ON current_opportunity_summary.opportunity_summary_id = opportunity_summary.opportunity_summary_id
JOIN link_opportunity_summary_funding_instrument ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_instrument.opportunity_summary_id
JOIN link_opportunity_summary_funding_category ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_category.opportunity_summary_id
JOIN link_opportunity_summary_applicant_type ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id
WHERE
opportunity.is_draft IS FALSE
AND(EXISTS (
SELECT
1 FROM current_opportunity_summary
WHERE
opportunity.opportunity_id = current_opportunity_summary.opportunity_id))
AND link_opportunity_summary_funding_instrument.funding_instrument_id in(1)
AND current_opportunity_summary.opportunity_status_id in (1, 2)
), summary_scores AS (
SELECT
opportunity_id,
word_similarity ('health',
summary_description) AS summary_score
FROM
opportunity_summary
WHERE opportunity_summary.opportunity_summary_id in (select opportunity_summary_id from filter_query)
),
opportunity_scores AS (
SELECT
opportunity_id,
word_similarity ('health',
opportunity_title) AS title_score,
word_similarity ('health',
opportunity_number) AS number_score,
word_similarity ('health',
agency) AS agency_score
FROM
opportunity
WHERE opportunity.opportunity_id in (select opportunity_id from filter_query)
),
assistance_listing_scores AS (
SELECT
opportunity_id,
word_similarity ('health',
assistance_listing_number) AS al_number_score,
word_similarity ('health',
program_title) AS program_title_score
FROM
opportunity_assistance_listing
WHERE opportunity_assistance_listing.opportunity_id in (select opportunity_id from filter_query)
),
agg_assistance_listing_scores AS (
SELECT
opportunity_id,
SUM(al_number_score) AS al_number_score,
SUM(program_title_score) AS program_title_score
FROM
assistance_listing_scores
GROUP BY
opportunity_id
),
score_sum AS (
SELECT
summary_scores.opportunity_id,
(summary_scores.summary_score + opportunity_scores.title_score + opportunity_scores.number_score + opportunity_scores.agency_score + agg_assistance_listing_scores.al_number_score + agg_assistance_listing_scores.program_title_score) AS total_score
FROM
summary_scores
LEFT JOIN opportunity_scores ON summary_scores.opportunity_id = opportunity_scores.opportunity_id
LEFT JOIN agg_assistance_listing_scores ON summary_scores.opportunity_id = agg_assistance_listing_scores.opportunity_id
)
SELECT
opportunity.opportunity_id, opportunity.opportunity_title, opportunity.agency, opportunity_summary.summary_description, score_sum.total_score
FROM
opportunity
JOIN current_opportunity_summary ON opportunity.opportunity_id = current_opportunity_summary.opportunity_id
JOIN opportunity_summary ON current_opportunity_summary.opportunity_summary_id = opportunity_summary.opportunity_summary_id, score_sum
WHERE
opportunity.opportunity_id = score_sum.opportunity_id
AND opportunity.opportunity_id in(
SELECT
opportunity_id FROM filter_query)
AND score_sum.total_score > 0.3
ORDER BY
score_sum.total_score DESC nulls LAST
LIMIT 25 OFFSET 1
@chouinar is this ticket still relevant, or is OpenSearch the answer?
@mxk0 - with the search index approach agreed on, I think we're good to close this ticket.
Summary
One thing we eventually want to support is the ability to do a relevancy sort in our search endpoint. When we add a search index, this will be a lot more flexible, and easier to support, but we want to see if there is some basic version we can create now for our first iteration of search that just queries against our database.
This ticket is to just do a bit of research and testing to find if there's a way to modify our current queries to support a very basic relevancy sort. Even if the scoring is just counting the number of matches, that would be fine.
A few ideas:
Note that our current query looks something like:
Which won't quite support any of the above approaches
Acceptance criteria