NASA-IMPACT / COSMOS

COSMOS is a web application designed to manage collections indexed in NASA's Science Discovery Engine (SDE), facilitating precise content selection and allowing metadata modification before indexing.
https://sde-indexing-helper.nasa-impact.net/
3 stars 1 forks source link

Ability to tell when new URLs are brought into COSMOS for a given collection #1015

Open code-geek opened 2 months ago

code-geek commented 2 months ago

Resources

Description

Right now, urls are only scraped one time, the collection is only curated one time, and it is only brought into prod one time. However, by next month, we will start reindexing sites. We will go back to the site and rescrape it, and during that process, a few things could happen:

So let's say that we reindex the site, and we get 20 new urls. Emily will now have to curate those 20 urls. But right now there is no way to know which 20 are new, or to tell Emily which 20 are new. So we need a way to both identify which 20 are new, and to let Emily know which 20 they are.

Similarly, if the titles change for 7 urls, Emily might need to update her title rules. The webapp needs a mechanism to identify which ones have changed, and then Emily needs a place where she can see the updated ones and potentially make fixes or changes to those specific 7 rules.

Existing Process

Right now the URL import process works like this:

This obviously loses information such as:

We need to rethink this process so that we can preserve old data while also highlighting anything new we've brought in.

Implementation Considerations

Open Questions

Do we need to retain the old urls so that we can get the delta? How will we retain them? Do we add in a slack feature in the long term when deltas are discovered? Do we set up the api to not supply new or changed urls to prod sde.

Deliverable

Dependencies

No response