Description:
This fixes a bug in the Elasticsearch download logic that was resulting in incorrect / duplicate data.
Technical details:
The root cause of the problem was introduced when Subaward downloads moved to Elasticsearch. The process for Elasticsearch downloads is:
Query ES and get a list of IDs that we should lookup in the DB for download
Insert those IDs into a lookup table that is joined on
Query the primary table for a download (e.g., award_search) by joining to the lookup table
The problem is that when Subawards were introduced into the ES download workflow they would also insert into the same lookup table and while it would capture the lookup_id_type, that field was not actually used in the download. That meant the two cases were occurring:
An ID was inserted into the lookup table that corresponded to both an award_id and broker_subaward_id and would result in data appearing in one of those downloads when it shouldn't
A download filter matched an Award and Subaward that happened to have the same value for award_id and broker_subaward_id respectively; this would result in duplicate rows in a download as the join used by the download query is a cartesian join which simply relies on a filter in the WHERE
NOTE: This issue also effects Transactions and Subawards in the same way.
The fix here is to make sure we use the lookup_id_type in the query to avoid pulling records we did not intend to pull. This was previously never an issue because you couldn't download Awards and Transactions in the same download.
[x] Performance evaluation of affected (API | Script | Download)
[x] Before / After data comparison
Area for explaining above N/A when needed:
We historically have never tested the contents of our downloads, only that the endpoints are completing.
While this is not a great excuse, it is the reason why I did not add any additional test cases.
Description: This fixes a bug in the Elasticsearch download logic that was resulting in incorrect / duplicate data.
Technical details: The root cause of the problem was introduced when Subaward downloads moved to Elasticsearch. The process for Elasticsearch downloads is:
award_search
) by joining to the lookup tableThe problem is that when Subawards were introduced into the ES download workflow they would also insert into the same lookup table and while it would capture the
lookup_id_type
, that field was not actually used in the download. That meant the two cases were occurring:award_id
andbroker_subaward_id
and would result in data appearing in one of those downloads when it shouldn'taward_id
andbroker_subaward_id
respectively; this would result in duplicate rows in a download as the join used by the download query is a cartesian join which simply relies on a filter in theWHERE
NOTE: This issue also effects Transactions and Subawards in the same way.
The fix here is to make sure we use the
lookup_id_type
in the query to avoid pulling records we did not intend to pull. This was previously never an issue because you couldn't download Awards and Transactions in the same download.Requirements for PR merge:
Area for explaining above N/A when needed: