GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
607 stars 98 forks source link

Identify approach for deriving latest set of records for a harvest source #4744

Closed rshewitt closed 4 months ago

rshewitt commented 4 months ago

User Story

In order to accurately run the compare, datagov wants to read the latest complete set of records from the db for a given harvest source

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

Background

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

given... identifier date_created raw_source state
a 2024-01-01 "data" "success"
a 2024-03-01 "data_1" "success"
b 2022-03-01 "data_10" "success"
b 2022-05-01 "data_30" "failed"
c 2020-03-01 "data_12" "success"
the result should be... identifier date_created raw_source state
a 2024-03-01 "data_1" "success"
b 2022-03-01 "data_10" "success"
c 2020-03-01 "data_12" "success"
SELECT * FROM (
      SELECT DISTINCT ON (identifier) identifier, date_created, source_raw, state, action, source_id
      FROM record 
      WHERE state = 'success' AND source_id = 'example-source'
      ORDER BY identifier, date_created DESC ) sq
WHERE sq.action != 'delete';

[Notes or a checklist reflecting our understanding of the selected approach]

jbrown-xentity commented 4 months ago

So the way to accomplish this is through a thing called window functions. Essentially it subselects everything and gives a row number and overwrites the columns you want that are different, and allows you to do the group by to make sure you get unique values. At this point you'll want to exclude "failures" from the initial window function subselect, that way if there should have been an update but it failed the update will re-try; and if the initial load failed it will retry as well (like the last job for that record didn't happen)

You'll probably need to create some mock data and test to make sure this comes out correctly, but it should be able to pull the most recent hash for every ID record in the dataset, and then we exclude all deleted ones (ID's marked for deletion)

rshewitt commented 4 months ago
CREATE TABLE record( 
    id int PRIMARY KEY, 
    identifier VARCHAR(50), 
    date_created TIMESTAMP, 
    source_raw VARCHAR(50), 
    state VARCHAR(15)
)

INSERT INTO record (id, identifier, date_created, source_raw, state ) VALUES 
(1, 'a', '2024-01-01 00:00:00.001', 'data', 'success'), 
(2, 'a', '2024-03-01 00:00:00.001', 'data_1', 'success'), 
(3, 'b', '2022-03-01 00:00:00.001', 'data_10', 'success'), 
(4, 'b', '2022-05-01 00:00:00.001', 'data_30', 'failed' ),
(5, 'c', '2024-05-01 00:00:00.001', 'data_12', 'success' )

SELECT * FROM record

 id | identifier |      date_created       | source_raw |  state  
----+------------+-------------------------+------------+---------
  1 | a          | 2024-01-01 00:00:00.001 | data       | success
  2 | a          | 2024-03-01 00:00:00.001 | data_1     | success
  3 | b          | 2022-03-01 00:00:00.001 | data_10    | success
  4 | b          | 2022-05-01 00:00:00.001 | data_30    | failed
  5 | c          | 2024-05-01 00:00:00.001 | data_12    | success
(5 rows)

SELECT DISTINCT ON (identifier) identifier, state, date_created, source_raw
FROM record
WHERE state = ‘success’
ORDER BY identifier, date_created DESC

 identifier |  state  |      date_created       | source_raw 
------------+---------+-------------------------+------------
 a          | success | 2024-03-01 00:00:00.001 | data_1
 b          | success | 2022-03-01 00:00:00.001 | data_10
 c          | success | 2024-05-01 00:00:00.001 | data_12
(3 rows)

this is meant to proof out being able to get the latest unique record based on a condition

FuhuXia commented 4 months ago

Good approach. Do we need an extra column action? and deleted record such as d should not be in the result.

  6 | d          | 2022-03-01 00:00:00.001 | data_3    | success | add
  7 | d          | 2024-05-01 00:00:00.001 |           | success | delete
rshewitt commented 4 months ago

implementation using sqlalchemy (source)

rshewitt commented 4 months ago

need to update the query to properly handle deleted records. using the query made in my comment above could return a historic record with action = "create". it shouldn't do that if the latest record is deleted. none of the historic records should be considered if the latest has been deleted.

rshewitt commented 4 months ago

validate the distinct on approach with pure sql with the exclusion of deleted records before implementing it in sqlalchemy. after getting a working sql command run it on staging to benchmark performance

rshewitt commented 4 months ago

postgres playground. i believe that test data should account for our use cases which are...

rshewitt commented 4 months ago

Schema (PostgreSQL v15)

    CREATE TABLE record( 
        id int PRIMARY KEY, 
        identifier VARCHAR(50), 
        date_created TIMESTAMP, 
        source_raw VARCHAR(50), 
        state VARCHAR(15),
        action VARCHAR(15),
        source_id VARCHAR(30)
    );

    INSERT INTO record (id, identifier, date_created, source_raw, state, action, source_id ) VALUES 
    (1, 'a', '2024-01-01 00:00:00.001', 'data', 'success', 'create', 'example-source'), 
    (2, 'a', '2024-03-01 00:00:00.001', 'data_1', 'success', 'update', 'example-source'), 
    (3, 'b', '2022-03-01 00:00:00.001', 'data_10', 'success', 'create', 'example-source'), 
    (4, 'b', '2022-05-01 00:00:00.001', 'data_30', 'failed', 'update', 'example-source' ),
    (5, 'c', '2024-05-01 00:00:00.001', 'data_12', 'success', 'create', 'example-source' ),
    (6, 'd', '2024-05-01 00:00:00.001', 'data_2', 'success', 'delete', 'example-source' ),
    (7, 'd', '2024-04-01 00:00:00.001', 'data_5', 'success', 'create', 'example-source' ),
    (8, 'e', '2024-04-01 00:00:00.001', 'data_123', 'success', 'create','example-source' ),
    (9, 'e', '2024-04-02 00:00:00.001', 'data_123', 'success', 'delete','example-source' ),
    (10, 'e', '2024-04-03 00:00:00.001', 'data_123', 'success', 'create','example-source' ),
    (11, 'f', '2024-04-01 00:00:00.001', 'data_87', 'success', 'create','example-source-other' );

    SELECT * FROM record;
id identifier date_created source_raw state action source_id
1 a 2024-01-01T00:00:00.001Z data success create example-source
2 a 2024-03-01T00:00:00.001Z data_1 success update example-source
3 b 2022-03-01T00:00:00.001Z data_10 success create example-source
4 b 2022-05-01T00:00:00.001Z data_30 failed update example-source
5 c 2024-05-01T00:00:00.001Z data_12 success create example-source
6 d 2024-05-01T00:00:00.001Z data_2 success delete example-source
7 d 2024-04-01T00:00:00.001Z data_5 success create example-source
8 e 2024-04-01T00:00:00.001Z data_123 success create example-source
9 e 2024-04-02T00:00:00.001Z data_123 success delete example-source
10 e 2024-04-03T00:00:00.001Z data_123 success create example-source
11 f 2024-04-01T00:00:00.001Z data_87 success create example-source-other

Option 1 ( using subquery )

    SELECT * FROM (
      SELECT DISTINCT ON (identifier) identifier, date_created, source_raw, state, action, source_id
      FROM record 
      WHERE state = 'success' AND source_id = 'example-source'
      ORDER BY identifier, date_created DESC ) sq
    WHERE sq.action != 'delete';
identifier date_created source_raw state action source_id
a 2024-03-01T00:00:00.001Z data_1 success update example-source
b 2022-03-01T00:00:00.001Z data_10 success create example-source
c 2024-05-01T00:00:00.001Z data_12 success create example-source
e 2024-04-03T00:00:00.001Z data_123 success create example-source

View on DB Fiddle

FuhuXia commented 4 months ago

the subquery execution time against staging catalog db is less than 200-400 ms.

=> select count(*) from (
SELECT DISTINCT ON (package_id) package_id, gathered, state, report_status, current
FROM harvest_object 
WHERE state = 'COMPLETE' AND harvest_source_id='58f92550-7a01-4f00-b1b2-8dc953bd598f'
ORDER BY package_id, gathered DESC 
) t
 where report_status!='deleted';
 count 
-------
 37369
(1 row)

Time: 328.239 ms

=> select * from (
SELECT DISTINCT ON (package_id) package_id, gathered, state, report_status, current
FROM harvest_object 
WHERE state = 'COMPLETE' AND harvest_source_id='58f92550-7a01-4f00-b1b2-8dc953bd598f'
ORDER BY package_id, gathered DESC 
) t
 where report_status!='deleted' limit 3;
              package_id              |          gathered          |  state   | report_status | current 
--------------------------------------+----------------------------+----------+---------------+---------
 00025aef-24c7-4090-ba92-1fb962c8483b | 2022-11-30 17:21:03.005029 | COMPLETE | updated       | t
 0002b09c-f26d-4229-9ac6-7617eb0fc10f | 2020-11-11 17:02:35.063411 | COMPLETE | added         | t
 000336b7-85bc-48c7-bb01-5def319b7961 | 2023-02-03 20:27:01.00436  | COMPLETE | updated       | t
(3 rows)

Time: 284.748 ms
rshewitt commented 4 months ago

thanks @FuhuXia for benchmarking those queries! looks like we can proceed with using the subquery option in the sketch.