HHS / simpler-grants-gov

https://simpler.grants.gov
Other
42 stars 13 forks source link

Investigate relevency scoring in search endpoint #1544

Closed chouinar closed 1 month ago

chouinar commented 7 months ago

Summary

One thing we eventually want to support is the ability to do a relevancy sort in our search endpoint. When we add a search index, this will be a lot more flexible, and easier to support, but we want to see if there is some basic version we can create now for our first iteration of search that just queries against our database.

This ticket is to just do a bit of research and testing to find if there's a way to modify our current queries to support a very basic relevancy sort. Even if the scoring is just counting the number of matches, that would be fine.

A few ideas:

Note that our current query looks something like:

SELECT
    opportunity.opportunity_id,
    opportunity.opportunity_number,
    opportunity.opportunity_title,
    opportunity.agency,
    opportunity.opportunity_category_id,
    opportunity.category_explanation,
    opportunity.is_draft,
    opportunity.revision_number,
    opportunity.modified_comments,
    opportunity.publisher_user_id,
    opportunity.publisher_profile_id,
    opportunity.created_at,
    opportunity.updated_at
FROM
    opportunity
    JOIN current_opportunity_summary ON opportunity.opportunity_id = current_opportunity_summary.opportunity_id
    JOIN opportunity_summary ON current_opportunity_summary.opportunity_summary_id = opportunity_summary.opportunity_summary_id
WHERE
    opportunity.opportunity_id in( SELECT DISTINCT
            opportunity.opportunity_id FROM opportunity
            JOIN current_opportunity_summary ON opportunity.opportunity_id = current_opportunity_summary.opportunity_id
            JOIN opportunity_summary ON current_opportunity_summary.opportunity_summary_id = opportunity_summary.opportunity_summary_id
            JOIN link_opportunity_summary_funding_instrument ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_instrument.opportunity_summary_id
            JOIN link_opportunity_summary_funding_category ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_category.opportunity_summary_id
            JOIN link_opportunity_summary_applicant_type ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id
            LEFT JOIN opportunity_assistance_listing ON opportunity.opportunity_id = opportunity_assistance_listing.opportunity_id
        WHERE
            opportunity.is_draft IS FALSE
            AND(EXISTS (
                    SELECT
                        1 FROM current_opportunity_summary
                    WHERE
                        opportunity.opportunity_id = current_opportunity_summary.opportunity_id))
                AND(opportunity.opportunity_title ILIKE '%example%'
                    OR opportunity.opportunity_number ILIKE '%example%'
                    OR opportunity.agency ILIKE '%example%'
                    OR opportunity_summary.summary_description ILIKE '%example%'
                    OR opportunity_assistance_listing.assistance_listing_number = 'example'
                    OR opportunity_assistance_listing.program_title ILIKE '%example%'))
ORDER BY
    opportunity_summary.post_date DESC nulls LAST
LIMIT 25 OFFSET 25

Which won't quite support any of the above approaches

Acceptance criteria

bretthrosenblatt commented 7 months ago

@chouinar If the this needs to be an external query, I'd make the following research recommendations:

1) Consider replacing the lookup table joins with recursive cte queries ('with' clause). Should be faster. While the db will attempt to inline the lookup data it can't always do so, and will often do so fully, before filtering

2) Consider replacing DISTINCT with a lateral cross join or intersect (depending on your logical needs)

3) Consider using a materialized view to replace the joins. Lookup tables are pretty static, and this can be fully automated with a refresh trigger. Flattening the data without strict denormalization will eliminate a lot of the work the db needs to do for search

4) Use deterministic functions for lookup values instead of joins. This is a little risky since you don't have a data profile yet; it can be dramatically faster (especially if you do the ranking there), but will also incur context-switching penalties and likely will show a higher cost. The advantage is for tables with low cardinality, where the function is called once and the result cached

widal001 commented 7 months ago

@chouinar Echoing Brett's comments above, do you have the SQLAlchemy query that generates this? I have a feeling some of the (potential) inefficiencies might be stemming from some of the translation that happens as a result of the ORM.

chouinar commented 7 months ago

@widal001 - https://github.com/HHS/simpler-grants-gov/blob/main/api/src/services/opportunities_v0_1/search_opportunities.py is where the SQLAlchemy queries are built. I built the queries first, and then implemented them in SQLAlchemy, so the implementation itself wasn't the constraining factor for the most part.

Definitely worth discussing the queries a bit more. I'll admit I'm not an expert when it comes to these very complex types of queries. I'll look to setup some time when I start working on this to figure out how we want to build out the queries. From what Brett put above, I think he'll have a much better idea for how to build some of this out than I would come up with.

widal001 commented 7 months ago

@chouinar Gotcha!! Thanks for linking this

Definitely worth discussing the queries a bit more. I'll admit I'm not an expert when it comes to these very complex types of queries. I'll look to setup some time when I start working on this to figure out how we want to build out the queries. From what Brett put above, I think he'll have a much better idea for how to build some of this out than I would come up with.

Would it be helpful for me to noodle on optimizing/simplifying this query with you to try to incorporate some of Brett's suggestions? Or are you excited to dig into this further yourself?

chouinar commented 7 months ago

@widal001 - Sure I'd be happy to chat through some ideas

widal001 commented 7 months ago

@chouinar Sounds great! I'll give this a bit more thought over the next couple of days and see if I can share an updated query with you that we can work on translating back to SQLAlchemy

chouinar commented 7 months ago

@widal001 - happy to pair on it, or if you want to look a bit on your own, one tip I can give when running the DB locally: If yo want a lot of data locally, I recently updated the seed script to take in an iteration count to produce more data easily. make db-seed-local args="--iterations 300" will generate ~10k opportunities (and records across several other tables), although be warned this takes like 10 minutes to run.

widal001 commented 7 months ago

@chouinar Here's a simple example of an approach that I think might help us with adding a relevancy score to the results that we can search by.

The basic strategy here is:

  1. Join the opportunity and summary tables in a CTE and instead of doing a straight filter in a WHERE use a CASE WHEN statement to assign that match a given weighted value (e.g. for example if we want to weight matches against title higher than against description)
  2. Do something similar for each lookup table aggregating by opportunity_summary_id to get a score for total number of matched rows for a given opportunity summary.
  3. Left join the linked table match CTEs to the current summary CTEs and then add the match_ columns to get a match_score value
  4. Filter the resulting table by rows where match_score is greater than 1 and then sort by match_score to get a relevancy sort.

Here's some SQL that you can run in a local SQLite db to check the basic approach:

-- ###################################
-- Drop tables
-- ###################################

DROP TABLE IF EXISTS opportunity;
DROP TABLE IF EXISTS opportunity_summary;
DROP TABLE IF EXISTS current_opportunity_summary;
DROP TABLE IF EXISTS lookup_attribute_a;
DROP TABLE IF EXISTS link_attribute_a;

-- ###################################
-- Create tables
-- ###################################

CREATE TABLE opportunity (
  opportunity_id INT PRIMARY KEY,
  opportunity_title TEXT,
  agency TEXT
);

CREATE TABLE opportunity_summary (
  opportunity_summary_id INT PRIMARY KEY,
  opportunity_id INT,
  summary_description TEXT
);

CREATE TABLE current_opportunity_summary (
  opportunity_id INT,
  opportunity_summary_id INT
);

CREATE TABLE lookup_attribute_a (
  attribute_a_id INT,
  description TEXT
);

CREATE TABLE link_attribute_a (
  opportunity_summary_id INT,
  attribute_a_id INT
);

-- ###################################
-- Populate tables
-- ###################################

INSERT INTO opportunity(opportunity_id, opportunity_title, agency) VALUES (1, 'Solar panel grant', 'Environmental Protection Agency');
INSERT INTO opportunity(opportunity_id, opportunity_title, agency) VALUES (2, 'Green power grant', 'Environmental Protection Agency');

INSERT INTO opportunity_summary (opportunity_id, opportunity_summary_id, summary_description) VALUES (1, 1, 'Grants for solar panels');
INSERT INTO opportunity_summary (opportunity_id, opportunity_summary_id, summary_description) VALUES (2, 2, 'Earn credits for hydropower projects');
INSERT INTO opportunity_summary (opportunity_id, opportunity_summary_id, summary_description) VALUES (2, 3, 'Earn credits for hydropower and solar projects');

INSERT INTO current_opportunity_summary (opportunity_id, opportunity_summary_id) VALUES (1, 1);
INSERT INTO current_opportunity_summary (opportunity_id, opportunity_summary_id) VALUES (2, 3);

INSERT INTO lookup_attribute_a (attribute_a_id, description) VALUES (1, 'solar');
INSERT INTO lookup_attribute_a (attribute_a_id, description) VALUES (2, 'hydro');
INSERT INTO lookup_attribute_a (attribute_a_id, description) VALUES (3, 'electricity');
INSERT INTO lookup_attribute_a (attribute_a_id, description) VALUES (4, 'solar panels');

INSERT INTO link_attribute_a (opportunity_summary_id, attribute_a_id) VALUES (1, 1);
INSERT INTO link_attribute_a (opportunity_summary_id, attribute_a_id) VALUES (1, 3);
INSERT INTO link_attribute_a (opportunity_summary_id, attribute_a_id) VALUES (1, 4);
INSERT INTO link_attribute_a (opportunity_summary_id, attribute_a_id) VALUES (3, 1);
INSERT INTO link_attribute_a (opportunity_summary_id, attribute_a_id) VALUES (3, 2);
INSERT INTO link_attribute_a (opportunity_summary_id, attribute_a_id) VALUES (3, 3);
INSERT INTO link_attribute_a (opportunity_summary_id, attribute_a_id) VALUES (3, 4);

-- ###################################
-- Search opportunities for 'solar'
-- ###################################

SELECT ' ';
SELECT 'Search opportunities for SOLAR';

-- Get current opportunity and create weighted scores for individual attribute matches
WITH current_opportunity AS (
    SELECT
        opportunity.opportunity_id,
        opportunity.opportunity_title,
        opportunity.agency,
        summary.opportunity_summary_id,
        summary.summary_description,
        CASE
            WHEN UPPER(opportunity.opportunity_title) LIKE '%SOLAR%'
            THEN 2
            ELSE 0
        END AS match_title,
        CASE
            WHEN UPPER(summary.summary_description) LIKE '%SOLAR%'
            THEN 1
            ELSE 0
        END AS match_description,
        CASE
            WHEN UPPER(opportunity.agency) LIKE '%SOLAR%'
            THEN 1
            ELSE 0
        END AS match_agency
    FROM opportunity
    JOIN current_opportunity_summary AS current_summary ON current_summary.opportunity_id = opportunity.opportunity_id
    JOIN opportunity_summary AS summary ON current_summary.opportunity_summary_id = summary.opportunity_summary_id
),
-- Filter link_attribute_a to only include records where attribute a matches search criterion
attribute_a_match_all AS (
    SELECT
        link.opportunity_summary_id,
        lookup.description,
        1 AS match_attr_a
    FROM lookup_attribute_a AS lookup
    JOIN link_attribute_a AS link ON lookup.attribute_a_id = link.attribute_a_id
    WHERE UPPER(description) LIKE '%SOLAR%'
),
-- Group by opportunity_summary_id to sum total matches and prevent fan outs
attribute_a_match_agg AS (
    SELECT
        opportunity_summary_id,
        SUM(match_attr_a) AS match_attr_a
    FROM attribute_a_match_all
    GROUP BY opportunity_summary_id
)
SELECT
    current_opportunity.opportunity_id,
    current_opportunity.opportunity_title,
    current_opportunity.agency,
    current_opportunity.summary_description,
    (
        current_opportunity.match_agency
        + current_opportunity.match_title
        + current_opportunity.match_description
        + attribute_a_match_agg.match_attr_a
    ) AS match_score
FROM current_opportunity
LEFT JOIN attribute_a_match_agg ON attribute_a_match_agg.opportunity_summary_id = current_opportunity.opportunity_summary_id
WHERE match_score >= 1
ORDER BY match_score DESC;

-- ###################################
-- Search opportunities for 'hydro'
-- ###################################

SELECT ' ';
SELECT 'Search opportunities for HYDRO';

-- Get current opportunity and create weighted scores for individual attribute matches
WITH current_opportunity AS (
    SELECT
        opportunity.opportunity_id,
        opportunity.opportunity_title,
        opportunity.agency,
        summary.opportunity_summary_id,
        summary.summary_description,
        CASE
            WHEN UPPER(opportunity.opportunity_title) LIKE '%HYDRO%'
            THEN 2
            ELSE 0
        END AS match_title,
        CASE
            WHEN UPPER(summary.summary_description) LIKE '%HYDRO%'
            THEN 1
            ELSE 0
        END AS match_description,
        CASE
            WHEN UPPER(opportunity.agency) LIKE '%HYDRO%'
            THEN 1
            ELSE 0
        END AS match_agency
    FROM opportunity
    JOIN current_opportunity_summary AS current_summary ON current_summary.opportunity_id = opportunity.opportunity_id
    JOIN opportunity_summary AS summary ON current_summary.opportunity_summary_id = summary.opportunity_summary_id
),
-- Filter link_attribute_a to only include records where attribute a matches search criterion
attribute_a_match_all AS (
    SELECT
        link.opportunity_summary_id,
        lookup.description,
        1 AS match_attr_a
    FROM lookup_attribute_a AS lookup
    JOIN link_attribute_a AS link ON lookup.attribute_a_id = link.attribute_a_id
    -- make sure we're only counting matches for the current summary
    JOIN current_opportunity_summary AS summary ON summary.opportunity_summary_id = link.opportunity_summary_id
    WHERE UPPER(description) LIKE '%HYDRO%'
),
-- Group by opportunity_summary_id to sum total matches and prevent fan outs
attribute_a_match_agg AS (
    SELECT
        opportunity_summary_id,
        SUM(match_attr_a) AS match_attr_a
    FROM attribute_a_match_all
    GROUP BY opportunity_summary_id
)
SELECT
    current_opportunity.opportunity_id,
    current_opportunity.opportunity_title,
    current_opportunity.agency,
    current_opportunity.summary_description,
    (
        current_opportunity.match_agency
        + current_opportunity.match_title
        + current_opportunity.match_description
        + attribute_a_match_agg.match_attr_a
    ) AS match_score
FROM current_opportunity
LEFT JOIN attribute_a_match_agg ON attribute_a_match_agg.opportunity_summary_id = current_opportunity.opportunity_summary_id
WHERE match_score >= 1
ORDER BY match_score DESC;

It should print out the following results:

sqlite> .read init_db.sql
' '

'Search opportunities for SOLAR'
Search opportunities for SOLAR
opportunity_id|opportunity_title|agency|summary_description|match_score
1|Solar panel grant|Environmental Protection Agency|Grants for solar panels|5
2|Green power grant|Environmental Protection Agency|Earn credits for hydropower and solar projects|3
' '

'Search opportunities for HYDRO'
Search opportunities for HYDRO
opportunity_id|opportunity_title|agency|summary_description|match_score
2|Green power grant|Environmental Protection Agency|Earn credits for hydropower and solar projects|2
chouinar commented 7 months ago

Been chatting in a separate thread, pulling a few ideas together, a current work-in-progress query idea for scoring.

I know it has a few issues (needs some coalesces for null values), but putting this here for discussion.

WITH RECURSIVE summary_scores AS (
    SELECT
        opportunity_id,
        word_similarity ('health',
            summary_description) AS summary_score
    FROM
        opportunity_summary
),
opportunity_scores AS (
    SELECT
        opportunity_id,
        word_similarity ('health',
            opportunity_title) AS title_score,
        word_similarity ('health',
            opportunity_number) AS number_score,
        word_similarity ('health',
            agency) AS agency_score
    FROM
        opportunity
),
assistance_listing_scores AS (
    SELECT
        opportunity_id,
        word_similarity ('health',
            assistance_listing_number) AS al_number_score,
        word_similarity ('health',
            program_title) AS program_title_score
    FROM
        opportunity_assistance_listing
),
agg_assistance_listing_scores AS (
    SELECT
        opportunity_id,
        SUM(al_number_score) AS al_number_score,
        SUM(program_title_score) AS program_title_score
    FROM
        assistance_listing_scores
    GROUP BY
        opportunity_id
),
score_sum AS (
    SELECT
        summary_scores.opportunity_id,
        (summary_scores.summary_score + opportunity_scores.title_score + opportunity_scores.number_score + opportunity_scores.agency_score + agg_assistance_listing_scores.al_number_score + agg_assistance_listing_scores.program_title_score) AS total_score
    FROM
        summary_scores
    LEFT JOIN opportunity_scores ON summary_scores.opportunity_id = opportunity_scores.opportunity_id
    LEFT JOIN agg_assistance_listing_scores ON summary_scores.opportunity_id = agg_assistance_listing_scores.opportunity_id
)
SELECT
    opportunity.opportunity_id,
    opportunity.opportunity_title,
    opportunity.agency,
    opportunity_summary.summary_description,
    score_sum.total_score
FROM
    opportunity
    JOIN current_opportunity_summary ON opportunity.opportunity_id = current_opportunity_summary.opportunity_id
    JOIN opportunity_summary ON current_opportunity_summary.opportunity_summary_id = opportunity_summary.opportunity_summary_id,
    score_sum
WHERE
    opportunity.opportunity_id = score_sum.opportunity_id
ORDER BY
    score_sum.total_score DESC nulls LAST
LIMIT 25 OFFSET 1
bretthrosenblatt commented 7 months ago

Here's the query with scores and counts (overall and per id) from the chats with recursive summa as (select opportunity_id, opportunity_summary_id, summary_description, word_similarity('hydra solar', summary_description) as dif, 'hydra' <-> summary_description as dist, count() over () as cnt, count() over (partition by opportunity_id) as cnt_per_id from opportunity_summary ) select opportunity.opportunity_id, opportunity.opportunity_title, opportunity.agency, summa.summary_description, summa.dif as matching_factor, summa.dist as distance, summa.cnt, summa.cnt_per_id from opportunity, summa where opportunity.opportunity_id = summa.opportunity_id;

results
chouinar commented 7 months ago

Here's an update to my above query, still missing a few things like coalesces/null handling for the scoring.

I added a filter_query which handles the filters (eg. you want just forecasted and posted opportunity statuses). I also made it so the score queries all reference it, which I actually think is required. Because the score queries are just going against a particular table like the opportunity summary, they may be calculating scores for past summaries, or ones that aren't the current. When we aggregate the scores, that could cause the score to be higher for something with a lot of revisions, which we wouldn't want, so those queries need to either have the logic that joins them to the current opportunity table, or instead just against the query that already handles that.

WITH RECURSIVE 
filter_query AS (
    SELECT
        opportunity.opportunity_id, opportunity_summary.opportunity_summary_id
    FROM
        opportunity
        JOIN current_opportunity_summary ON opportunity.opportunity_id = current_opportunity_summary.opportunity_id
        JOIN opportunity_summary ON current_opportunity_summary.opportunity_summary_id = opportunity_summary.opportunity_summary_id
        JOIN link_opportunity_summary_funding_instrument ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_instrument.opportunity_summary_id
        JOIN link_opportunity_summary_funding_category ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_funding_category.opportunity_summary_id
        JOIN link_opportunity_summary_applicant_type ON opportunity_summary.opportunity_summary_id = link_opportunity_summary_applicant_type.opportunity_summary_id
    WHERE
        opportunity.is_draft IS FALSE
        AND(EXISTS (
                SELECT
                    1 FROM current_opportunity_summary
                WHERE
                    opportunity.opportunity_id = current_opportunity_summary.opportunity_id))
        AND link_opportunity_summary_funding_instrument.funding_instrument_id in(1)
        AND current_opportunity_summary.opportunity_status_id in (1, 2)
), summary_scores AS (
    SELECT
        opportunity_id,
        word_similarity ('health',
            summary_description) AS summary_score
    FROM
        opportunity_summary
    WHERE opportunity_summary.opportunity_summary_id in (select opportunity_summary_id from filter_query)
),
opportunity_scores AS (
    SELECT
        opportunity_id,
        word_similarity ('health',
            opportunity_title) AS title_score,
        word_similarity ('health',
            opportunity_number) AS number_score,
        word_similarity ('health',
            agency) AS agency_score
    FROM
        opportunity

    WHERE opportunity.opportunity_id in (select opportunity_id from filter_query)
),
assistance_listing_scores AS (
    SELECT
        opportunity_id,
        word_similarity ('health',
            assistance_listing_number) AS al_number_score,
        word_similarity ('health',
            program_title) AS program_title_score
    FROM
        opportunity_assistance_listing

    WHERE opportunity_assistance_listing.opportunity_id in (select opportunity_id from filter_query)

),
agg_assistance_listing_scores AS (
    SELECT
        opportunity_id,
        SUM(al_number_score) AS al_number_score,
        SUM(program_title_score) AS program_title_score
    FROM
        assistance_listing_scores
    GROUP BY
        opportunity_id
),
score_sum AS (
    SELECT
        summary_scores.opportunity_id,
        (summary_scores.summary_score + opportunity_scores.title_score + opportunity_scores.number_score + opportunity_scores.agency_score + agg_assistance_listing_scores.al_number_score + agg_assistance_listing_scores.program_title_score) AS total_score
    FROM
        summary_scores
    LEFT JOIN opportunity_scores ON summary_scores.opportunity_id = opportunity_scores.opportunity_id
    LEFT JOIN agg_assistance_listing_scores ON summary_scores.opportunity_id = agg_assistance_listing_scores.opportunity_id
)

SELECT
    opportunity.opportunity_id, opportunity.opportunity_title, opportunity.agency, opportunity_summary.summary_description, score_sum.total_score
FROM
    opportunity
    JOIN current_opportunity_summary ON opportunity.opportunity_id = current_opportunity_summary.opportunity_id
    JOIN opportunity_summary ON current_opportunity_summary.opportunity_summary_id = opportunity_summary.opportunity_summary_id, score_sum
WHERE
    opportunity.opportunity_id = score_sum.opportunity_id
    AND opportunity.opportunity_id in(
        SELECT
            opportunity_id FROM filter_query)
    AND score_sum.total_score > 0.3
ORDER BY
    score_sum.total_score DESC nulls LAST
LIMIT 25 OFFSET 1
mxk0 commented 1 month ago

@chouinar is this ticket still relevant, or is OpenSearch the answer?

chouinar commented 1 month ago

@mxk0 - with the search index approach agreed on, I think we're good to close this ticket.