Support script to reconcile all collection metadata

anayeaye commented 10 months ago

What

We need a tool for reconciling the corrected collection metadata records in this project with the records in a target VEDA instance (or for bulk loading a new instance). Format can be cli or notebook--it just needs to be re-usable.

The architecture wiki contains identifying and usage information for the operational auth and ingest systems.

PI Objective

https://github.com/NASA-IMPACT/veda-architecture/issues/356

Requirements

Dry run mode run to test before updating actual collections
Authenticates a user to obtain a token for the ingest api For each collection json veda-data/ingestion-data/collections
Requests existing collections/ for existing record
If existing record exists, merge the summaries information from the existing collection into the veda-data collection json locally. EDIT summaries should be persisted in existing catalogs like staging-stac but should not be published back to veda-data (this requirement changed after a 2024-02-01 walk through in which we realized that storing this aggregated data value in veda-data would not be beneficial and could be potentially problematic).
Publishes the merged/best record to the target ingestion api/collections endpoint

Updated: Expected Differences

We expect the following properties to be added or updated

providers
stac_extensions
renders
the license is corrected for some collections which didn’t use a proper SPDX identifier (License identifier)

Dashboard concerns

The dashboard uses the staging database and depends on the 'summaries' property that is added by the ingestion pipeline, we want to preserve it in the update. Summaries can be recreated using the user defined postgres update collection summaries function (example).

AC

[x] ~veda-data/ingestion-data/collections updated with summary information obtained from staging database~ EDIT: we determined that summaries should always correspond to the actual items in a given database and should not be published to veda-data, as in: we don't want to publish an empty collection with a summary of items that have not yet been ingested so we will not include summaries in the veda-data project.
[x] ~staging~ dev database collections updated to match the veda-data/ingestion-data/collections metadata which have expanded render and provider information and have been checked for validation errors
[x] bulk loading support script documented and shared in this project--it will be used to bulk load the production database

anayeaye commented 9 months ago

Comment on links:

Stac-fastapi dynamically adds referential links to API responses. We should ignore stac-api results links with these "rel" types in updates. "root", "collection", "parent", "self", "items"

In some cases, data curators/providers may use links for other purposes like the fire vector collections. We should keep stac-api results with links with these "rel" types. If external links are returned in the stac-api response, we should add those external links to the collection document in veda-data. "external"

botanical commented 9 months ago

Based on the discussion we had at the PR walkthrough meeting, should the requirements be updated to something like:

## Requirements
0. Dry run mode run to test before updating actual collections
1. Authenticates a user to obtain a token for the ingest api
For each collection json `veda-data/ingestion-data/collections`
3. Requests existing collections/<collection_id> for existing record
4. If existing record exists, merge the summaries information from the existing collection into a copy of the collection json, specifically not writing the merged summaries back to the `veda-data` collection
5. Publishes the merged/best record to the target ingestion api/collections endpoint

Where the fourth requirement is updated to specify that the veda-data collection is not overwritten? 🤔 @anayeaye

anayeaye commented 9 months ago

Thanks @botanical, I updated the requirements to reflect the outcome of the discussion we had in the PR walkthrough. So now the requirements include persisting dataset summaries that already exist in a database but no longer includes inserting those summaries into the collection records stored in this project.

botanical commented 9 months ago

My work is technically blocked by this issue: https://github.com/NASA-IMPACT/veda-architecture/issues/384 but I think this is being worked on by @slesaad (PR https://github.com/NASA-IMPACT/veda-backend/pull/293)

@anayeaye please correct me if I'm wrong 😅

anayeaye commented 9 months ago

@botanical that sounds right to me. After @slesaad's PR is merged you could technically run the notebook against the dev veda-backend ingest. If that works we can definitely merge this and call it done based on the dev backend test.

P.S. I just updated the acceptance criteria again to specify updating dev instead of staging.

slesaad commented 9 months ago

the PR is merged, but looks like dev deployment is blocked by auth - see the predeploy failure here; need to fix that first

anayeaye commented 9 months ago

@slesaad actually your merge into develop looks good here. I should have mentioned that I opened a PR for develop>main to see if we would be ready to promote (I will make an issue to work out the issues that are raised by the failed pre deploy check).

NASA-IMPACT / veda-data