COSMOS is a web application designed to manage collections indexed in NASA's Science Discovery Engine (SDE), facilitating precise content selection and allowing metadata modification before indexing.
Right now, urls are only scraped one time, the collection is only curated one time, and it is only brought into prod one time. However, by next month, we will start reindexing sites. We will go back to the site and rescrape it, and during that process, a few things could happen:
urls could disapear
new urls could be added
the metadata for old urls could change (full text, title, etc)
So let's say that we reindex the site, and we get 20 new urls. Emily will now have to curate those 20 urls. But right now there is no way to know which 20 are new, or to tell Emily which 20 are new. So we need a way to both identify which 20 are new, and to let Emily know which 20 they are.
Similarly, if the titles change for 7 urls, Emily might need to update her title rules. The webapp needs a mechanism to identify which ones have changed, and then Emily needs a place where she can see the updated ones and potentially make fixes or changes to those specific 7 rules.
Existing Process
Right now the URL import process works like this:
Delete all existing Candidate URLs
Bring in fresh Candidate URLs from whichever server the user chooses
This obviously loses information such as:
How many URLs were there before
How many new URLs have been brought in
How many old URLs were removed
...and any associated metadata (scraped_title, etc)
We need to rethink this process so that we can preserve old data while also highlighting anything new we've brought in.
Implementation Considerations
Needs a lot of frontend changes
Might need to be taken on fairly soon
Open Questions
Do we need to retain the old urls so that we can get the delta?
How will we retain them?
Do we add in a slack feature in the long term when deltas are discovered?
Do we set up the api to not supply new or changed urls to prod sde.
Resources
Description
Right now, urls are only scraped one time, the collection is only curated one time, and it is only brought into prod one time. However, by next month, we will start reindexing sites. We will go back to the site and rescrape it, and during that process, a few things could happen:
So let's say that we reindex the site, and we get 20 new urls. Emily will now have to curate those 20 urls. But right now there is no way to know which 20 are new, or to tell Emily which 20 are new. So we need a way to both identify which 20 are new, and to let Emily know which 20 they are.
Similarly, if the titles change for 7 urls, Emily might need to update her title rules. The webapp needs a mechanism to identify which ones have changed, and then Emily needs a place where she can see the updated ones and potentially make fixes or changes to those specific 7 rules.
Existing Process
Right now the URL import process works like this:
This obviously loses information such as:
We need to rethink this process so that we can preserve old data while also highlighting anything new we've brought in.
Implementation Considerations
Open Questions
Do we need to retain the old urls so that we can get the delta? How will we retain them? Do we add in a slack feature in the long term when deltas are discovered? Do we set up the api to not supply new or changed urls to prod sde.
Deliverable
Dependencies
No response