Create Initial Compare Function for DCAT-US

btylerburton commented 9 months ago

User Story

In order to load test our compare solution, datagovteam wants to develop the initial iteration of our compare app functionality for DCAT-US.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

[ ] GIVEN I have a DCAT-US harvest source that has been loaded into our Dynamic DAG ETL pipeline and extracted into individual records THEN I would like to use a hashing function on each record and store that in an iterable map (harvest_source_map) in the form id: source_hash
[ ] GIVEN I have loaded the same harvest source from the CKAN DB THEN I would like to use the recorded result of the same hashing function stored as metadata in the CKAN record to create an iterable map (catalog_source_map) in the form id: source_hash

For the below operations assume a for loop over the harvest_source_map is in progress:

[ ] GIVEN the ID of the dataset is found in the CKAN catalog but the hash is not the same THEN I want to add that dataset to the list of items to update (packages_to_create)
[ ] GIVEN the ID of the dataset is not found in the CKAN catalog THEN I want to add that dataset to the list of items to create (packages_to_create)
[ ] GIVEN the ID of the dataset is found and the hash is the same THEN I know the dataset is unchanged, and I can move onto the next record
[ ] GIVEN that all records in the harvest source have been traversed THEN I know that the ID's which remain in the Catalog hashmap can be deleted from CKAN catalog (packages_to_destroy)

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

diagram

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

[ ] Add new code to datagov-harvesting-logic to satisfy the AC above
[ ] Write tests in datagov-harvesting-logic that cover:
- [ ] update
- [ ] create
- [ ] destroy
- [ ] pass
[ ] Push a new version to PyPi
[ ] Integrate that into our existing Airflow test instance

rshewitt commented 8 months ago

compare branch. added compare logic and unit test. need to add integration test against real ckan endpoint.

rshewitt commented 8 months ago

ckan adds things to a dataset which doesn't or may not derive from the catalog itself ( e.g. metadata_created defaults to utcnow, license_id appears to default to "notspecified" if unspecified, information about the dataset organization is auto-populated. see picture )

Screenshot 2024-01-02 at 9 58 08 AM

if we intend to compare by hash we would need to remove everything ckan adds to ensure we're comparing accurately.

btylerburton commented 8 months ago

Alternately, we can hash the dataset prior to pushing to CKAN and store it in S3. Then we compare the incoming hash with the previously recorded one. This also allows us to bypass CKAN API for fetching the datasets and to control the amount of information that is hashed.

jbrown-xentity commented 8 months ago

Ah. So this becomes a sticky problem. Currently CKAN houses the original, raw metadata in the harvest_object table, and can report that on demand. The dataset has a link to that item in the harvest_object table. We'll need to recreate something similar, whereby we have CKAN (or S3 per Tyler's suggestion) store the original (maybe original but sorted?) metadata to compare the source against.

jbrown-xentity commented 8 months ago

After discussing with @rshewitt , we're going to move forward with putting the raw metadata into the catalog. There are a couple of reasons for this, namely that it is often referenced and available and used in the API currently. Since the transformations from CSDGM and/or ISO to DCAT-US are "lossy" (not all fields in CSDGM and ISO have an equivalent in DCAT-US), we want the raw metadata available to end-users (it's not just for the harvesting process). That being said, we now need to build a few components into the load function: tracking what "source" a dataset came from, and having the raw DCAT-US json object stored as an extra. Then we can write a "CKAN extract", which pulls all data sets for a given source, hashes the raw object, and then sends to the compare. See diagram above for the full workflow. Tagging @btylerburton for awareness.

rshewitt commented 8 months ago

package search on dev wasn't working as intended for a recently added dataset. I used catalog-dev.data.gov as the ckan route. querying for something like the programCode would return no results when it should return something. @jbrown-xentity think's there could be an issue with solr for that route. package creation has to be reenabled via the nginx config, not sure if this contributes to this issue. @FuhuXia do you know what could be causing this?

rshewitt commented 8 months ago

changing the ckan route to catalog-dev-admin-datagov.app.cloud.gov fixed the issue.

rshewitt commented 8 months ago

pagination is possible using something like...

num_rows = 1000
start = 0 
count = 300000
for i in range( 0, count, num_rows ):
  url = f"https://catalog.data.gov/api/action/package_search?q=*:*&rows={num_rows}&start={i}"
  res = requests.get( url ) 
  # do something with the response
  start = i

confirmed using catalog

GSA / data.gov