fedspendingtransparency / usaspending-api

Server application to serve U.S. federal spending data via a RESTful API
https://www.usaspending.gov
Creative Commons Zero v1.0 Universal
310 stars 113 forks source link

[DEV-11361] Update ES download query #4214

Closed sethstoudenmier closed 1 month ago

sethstoudenmier commented 1 month ago

Description: This fixes a bug in the Elasticsearch download logic that was resulting in incorrect / duplicate data.

Technical details: The root cause of the problem was introduced when Subaward downloads moved to Elasticsearch. The process for Elasticsearch downloads is:

  1. Query ES and get a list of IDs that we should lookup in the DB for download
  2. Insert those IDs into a lookup table that is joined on
  3. Query the primary table for a download (e.g., award_search) by joining to the lookup table

The problem is that when Subawards were introduced into the ES download workflow they would also insert into the same lookup table and while it would capture the lookup_id_type, that field was not actually used in the download. That meant the two cases were occurring:

NOTE: This issue also effects Transactions and Subawards in the same way.

The fix here is to make sure we use the lookup_id_type in the query to avoid pulling records we did not intend to pull. This was previously never an issue because you couldn't download Awards and Transactions in the same download.

Requirements for PR merge:

  1. [ ] Unit & integration tests updated
  2. [x] API documentation updated
  3. [x] Necessary PR reviewers:
    • [x] Backend
    • [ ] Frontend
    • [ ] Operations
    • [ ] Domain Expert
  4. [x] Matview impact assessment completed
  5. [x] Frontend impact assessment completed
  6. [x] Data validation completed
  7. [x] Appropriate Operations ticket(s) created
  8. [x] Jira Ticket DEV-11361:
    • [x] Link to this Pull-Request
    • [x] Performance evaluation of affected (API | Script | Download)
    • [x] Before / After data comparison

Area for explaining above N/A when needed:

We historically have never tested the contents of our downloads, only that the endpoints are completing.
While this is not a great excuse, it is the reason why I did not add any additional test cases.