GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
594 stars 92 forks source link

Create Initial Compare Function for DCAT-US #4557

Closed btylerburton closed 6 months ago

btylerburton commented 9 months ago

User Story

In order to load test our compare solution, datagovteam wants to develop the initial iteration of our compare app functionality for DCAT-US.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

For the below operations assume a for loop over the harvest_source_map is in progress:

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

diagram

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

rshewitt commented 8 months ago

compare branch. added compare logic and unit test. need to add integration test against real ckan endpoint.

rshewitt commented 8 months ago

ckan adds things to a dataset which doesn't or may not derive from the catalog itself ( e.g. metadata_created defaults to utcnow, license_id appears to default to "notspecified" if unspecified, information about the dataset organization is auto-populated. see picture )

Screenshot 2024-01-02 at 9 58 08 AM

if we intend to compare by hash we would need to remove everything ckan adds to ensure we're comparing accurately.

btylerburton commented 8 months ago

Alternately, we can hash the dataset prior to pushing to CKAN and store it in S3. Then we compare the incoming hash with the previously recorded one. This also allows us to bypass CKAN API for fetching the datasets and to control the amount of information that is hashed.

jbrown-xentity commented 8 months ago

Ah. So this becomes a sticky problem. Currently CKAN houses the original, raw metadata in the harvest_object table, and can report that on demand. The dataset has a link to that item in the harvest_object table. We'll need to recreate something similar, whereby we have CKAN (or S3 per Tyler's suggestion) store the original (maybe original but sorted?) metadata to compare the source against.

jbrown-xentity commented 8 months ago

After discussing with @rshewitt , we're going to move forward with putting the raw metadata into the catalog. There are a couple of reasons for this, namely that it is often referenced and available and used in the API currently. Since the transformations from CSDGM and/or ISO to DCAT-US are "lossy" (not all fields in CSDGM and ISO have an equivalent in DCAT-US), we want the raw metadata available to end-users (it's not just for the harvesting process). That being said, we now need to build a few components into the load function: tracking what "source" a dataset came from, and having the raw DCAT-US json object stored as an extra. Then we can write a "CKAN extract", which pulls all data sets for a given source, hashes the raw object, and then sends to the compare. See diagram above for the full workflow. Tagging @btylerburton for awareness.

rshewitt commented 8 months ago

package search on dev wasn't working as intended for a recently added dataset. I used catalog-dev.data.gov as the ckan route. querying for something like the programCode would return no results when it should return something. @jbrown-xentity think's there could be an issue with solr for that route. package creation has to be reenabled via the nginx config, not sure if this contributes to this issue. @FuhuXia do you know what could be causing this?

rshewitt commented 8 months ago

changing the ckan route to catalog-dev-admin-datagov.app.cloud.gov fixed the issue.

rshewitt commented 8 months ago

pagination is possible using something like...

num_rows = 1000
start = 0 
count = 300000
for i in range( 0, count, num_rows ):
  url = f"https://catalog.data.gov/api/action/package_search?q=*:*&rows={num_rows}&start={i}"
  res = requests.get( url ) 
  # do something with the response
  start = i 

confirmed using catalog