Identify approach for deriving latest set of records for a harvest source

rshewitt commented 4 months ago

User Story

In order to accurately run the compare, datagov wants to read the latest complete set of records from the db for a given harvest source

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

[ ] GIVEN a harvest source \ WHEN the db read happens \ THEN the latest complete set of records should be returned

Background

a "complete set" refers to all successful and non-deleted unique records of a harvest source according to an identifier ( "identifier" in dcatus or wav_url + dataset_name in waf )
the harvest record table contains current and historic versions of a harvest record which prompts the question of how do we derive the latest complete set of successful records for a given harvest source?
a "historic" record, in the context of what needs to be used in the compare, is any record older than the most recent with a state = "success"
we expect records to be inserted according to the compare result

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

given...	identifier	date_created	raw_source
a	2024-01-01	"data"	"success"
a	2024-03-01	"data_1"	"success"
b	2022-03-01	"data_10"	"success"
b	2022-05-01	"data_30"	"failed"
c	2020-03-01	"data_12"	"success"

the result should be...	identifier	date_created	raw_source
a	2024-03-01	"data_1"	"success"
b	2022-03-01	"data_10"	"success"
c	2020-03-01	"data_12"	"success"

SELECT * FROM (
      SELECT DISTINCT ON (identifier) identifier, date_created, source_raw, state, action, source_id
      FROM record 
      WHERE state = 'success' AND source_id = 'example-source'
      ORDER BY identifier, date_created DESC ) sq
WHERE sq.action != 'delete';

[Notes or a checklist reflecting our understanding of the selected approach]

jbrown-xentity commented 4 months ago

So the way to accomplish this is through a thing called window functions. Essentially it subselects everything and gives a row number and overwrites the columns you want that are different, and allows you to do the group by to make sure you get unique values. At this point you'll want to exclude "failures" from the initial window function subselect, that way if there should have been an update but it failed the update will re-try; and if the initial load failed it will retry as well (like the last job for that record didn't happen)

You'll probably need to create some mock data and test to make sure this comes out correctly, but it should be able to pull the most recent hash for every ID record in the dataset, and then we exclude all deleted ones (ID's marked for deletion)

rshewitt commented 4 months ago

CREATE TABLE record( 
    id int PRIMARY KEY, 
    identifier VARCHAR(50), 
    date_created TIMESTAMP, 
    source_raw VARCHAR(50), 
    state VARCHAR(15)
)

INSERT INTO record (id, identifier, date_created, source_raw, state ) VALUES 
(1, 'a', '2024-01-01 00:00:00.001', 'data', 'success'), 
(2, 'a', '2024-03-01 00:00:00.001', 'data_1', 'success'), 
(3, 'b', '2022-03-01 00:00:00.001', 'data_10', 'success'), 
(4, 'b', '2022-05-01 00:00:00.001', 'data_30', 'failed' ),
(5, 'c', '2024-05-01 00:00:00.001', 'data_12', 'success' )

SELECT * FROM record

 id | identifier |      date_created       | source_raw |  state  
----+------------+-------------------------+------------+---------
  1 | a          | 2024-01-01 00:00:00.001 | data       | success
  2 | a          | 2024-03-01 00:00:00.001 | data_1     | success
  3 | b          | 2022-03-01 00:00:00.001 | data_10    | success
  4 | b          | 2022-05-01 00:00:00.001 | data_30    | failed
  5 | c          | 2024-05-01 00:00:00.001 | data_12    | success
(5 rows)

SELECT DISTINCT ON (identifier) identifier, state, date_created, source_raw
FROM record
WHERE state = ‘success’
ORDER BY identifier, date_created DESC

 identifier |  state  |      date_created       | source_raw 
------------+---------+-------------------------+------------
 a          | success | 2024-03-01 00:00:00.001 | data_1
 b          | success | 2022-03-01 00:00:00.001 | data_10
 c          | success | 2024-05-01 00:00:00.001 | data_12
(3 rows)

this is meant to proof out being able to get the latest unique record based on a condition

FuhuXia commented 4 months ago

Good approach. Do we need an extra column action? and deleted record such as d should not be in the result.

  6 | d          | 2022-03-01 00:00:00.001 | data_3    | success | add
  7 | d          | 2024-05-01 00:00:00.001 |           | success | delete

rshewitt commented 4 months ago

implementation using sqlalchemy (source)

rshewitt commented 4 months ago

need to update the query to properly handle deleted records. using the query made in my comment above could return a historic record with action = "create". it shouldn't do that if the latest record is deleted. none of the historic records should be considered if the latest has been deleted.

rshewitt commented 4 months ago

validate the distinct on approach with pure sql with the exclusion of deleted records before implementing it in sqlalchemy. after getting a working sql command run it on staging to benchmark performance

rshewitt commented 4 months ago

postgres playground. i believe that test data should account for our use cases which are...

get records based on a source id
get unique records based on identifier
get the latest records based on date_created
exclude records that were not successful
exclude latest records that are deleted
include latest records that were previously deleted

rshewitt commented 4 months ago

Schema (PostgreSQL v15)

    CREATE TABLE record( 
        id int PRIMARY KEY, 
        identifier VARCHAR(50), 
        date_created TIMESTAMP, 
        source_raw VARCHAR(50), 
        state VARCHAR(15),
        action VARCHAR(15),
        source_id VARCHAR(30)
    );

    INSERT INTO record (id, identifier, date_created, source_raw, state, action, source_id ) VALUES 
    (1, 'a', '2024-01-01 00:00:00.001', 'data', 'success', 'create', 'example-source'), 
    (2, 'a', '2024-03-01 00:00:00.001', 'data_1', 'success', 'update', 'example-source'), 
    (3, 'b', '2022-03-01 00:00:00.001', 'data_10', 'success', 'create', 'example-source'), 
    (4, 'b', '2022-05-01 00:00:00.001', 'data_30', 'failed', 'update', 'example-source' ),
    (5, 'c', '2024-05-01 00:00:00.001', 'data_12', 'success', 'create', 'example-source' ),
    (6, 'd', '2024-05-01 00:00:00.001', 'data_2', 'success', 'delete', 'example-source' ),
    (7, 'd', '2024-04-01 00:00:00.001', 'data_5', 'success', 'create', 'example-source' ),
    (8, 'e', '2024-04-01 00:00:00.001', 'data_123', 'success', 'create','example-source' ),
    (9, 'e', '2024-04-02 00:00:00.001', 'data_123', 'success', 'delete','example-source' ),
    (10, 'e', '2024-04-03 00:00:00.001', 'data_123', 'success', 'create','example-source' ),
    (11, 'f', '2024-04-01 00:00:00.001', 'data_87', 'success', 'create','example-source-other' );

    SELECT * FROM record;

id	identifier	date_created	source_raw	state	action	source_id
1	a	2024-01-01T00:00:00.001Z	data	success	create	example-source
2	a	2024-03-01T00:00:00.001Z	data_1	success	update	example-source
3	b	2022-03-01T00:00:00.001Z	data_10	success	create	example-source
4	b	2022-05-01T00:00:00.001Z	data_30	failed	update	example-source
5	c	2024-05-01T00:00:00.001Z	data_12	success	create	example-source
6	d	2024-05-01T00:00:00.001Z	data_2	success	delete	example-source
7	d	2024-04-01T00:00:00.001Z	data_5	success	create	example-source
8	e	2024-04-01T00:00:00.001Z	data_123	success	create	example-source
9	e	2024-04-02T00:00:00.001Z	data_123	success	delete	example-source
10	e	2024-04-03T00:00:00.001Z	data_123	success	create	example-source
11	f	2024-04-01T00:00:00.001Z	data_87	success	create	example-source-other

Option 1 ( using subquery )

    SELECT * FROM (
      SELECT DISTINCT ON (identifier) identifier, date_created, source_raw, state, action, source_id
      FROM record 
      WHERE state = 'success' AND source_id = 'example-source'
      ORDER BY identifier, date_created DESC ) sq
    WHERE sq.action != 'delete';

identifier	date_created	source_raw	state	action	source_id
a	2024-03-01T00:00:00.001Z	data_1	success	update	example-source
b	2022-03-01T00:00:00.001Z	data_10	success	create	example-source
c	2024-05-01T00:00:00.001Z	data_12	success	create	example-source
e	2024-04-03T00:00:00.001Z	data_123	success	create	example-source

View on DB Fiddle

FuhuXia commented 4 months ago

the subquery execution time against staging catalog db is less than 200-400 ms.

=> select count(*) from (
SELECT DISTINCT ON (package_id) package_id, gathered, state, report_status, current
FROM harvest_object 
WHERE state = 'COMPLETE' AND harvest_source_id='58f92550-7a01-4f00-b1b2-8dc953bd598f'
ORDER BY package_id, gathered DESC 
) t
 where report_status!='deleted';
 count 
-------
 37369
(1 row)

Time: 328.239 ms

=> select * from (
SELECT DISTINCT ON (package_id) package_id, gathered, state, report_status, current
FROM harvest_object 
WHERE state = 'COMPLETE' AND harvest_source_id='58f92550-7a01-4f00-b1b2-8dc953bd598f'
ORDER BY package_id, gathered DESC 
) t
 where report_status!='deleted' limit 3;
              package_id              |          gathered          |  state   | report_status | current 
--------------------------------------+----------------------------+----------+---------------+---------
 00025aef-24c7-4090-ba92-1fb962c8483b | 2022-11-30 17:21:03.005029 | COMPLETE | updated       | t
 0002b09c-f26d-4229-9ac6-7617eb0fc10f | 2020-11-11 17:02:35.063411 | COMPLETE | added         | t
 000336b7-85bc-48c7-bb01-5def319b7961 | 2023-02-03 20:27:01.00436  | COMPLETE | updated       | t
(3 rows)

Time: 284.748 ms

rshewitt commented 4 months ago

thanks @FuhuXia for benchmarking those queries! looks like we can proceed with using the subquery option in the sketch.

GSA / data.gov