Ability to tell when new URLs are brought into COSMOS for a given collection

Resources

Architecture Document

Description

Right now, urls are only scraped one time, the collection is only curated one time, and it is only brought into prod one time. However, by next month, we will start reindexing sites. We will go back to the site and rescrape it, and during that process, a few things could happen:

urls could disapear
new urls could be added
the metadata for old urls could change (full text, title, etc)

So let's say that we reindex the site, and we get 20 new urls. Emily will now have to curate those 20 urls. But right now there is no way to know which 20 are new, or to tell Emily which 20 are new. So we need a way to both identify which 20 are new, and to let Emily know which 20 they are.

Similarly, if the titles change for 7 urls, Emily might need to update her title rules. The webapp needs a mechanism to identify which ones have changed, and then Emily needs a place where she can see the updated ones and potentially make fixes or changes to those specific 7 rules.

Existing Process

Right now the URL import process works like this:

Delete all existing Candidate URLs
Bring in fresh Candidate URLs from whichever server the user chooses

This obviously loses information such as:

How many URLs were there before
How many new URLs have been brought in
How many old URLs were removed
...and any associated metadata (scraped_title, etc)

We need to rethink this process so that we can preserve old data while also highlighting anything new we've brought in.

Implementation Considerations

Needs a lot of frontend changes
Might need to be taken on fairly soon

Open Questions

Do we need to retain the old urls so that we can get the delta? How will we retain them? Do we add in a slack feature in the long term when deltas are discovered? Do we set up the api to not supply new or changed urls to prod sde.

Deliverable

Design doc

Dependencies

No response

NASA-IMPACT / COSMOS