Overview of procedure and documentation

Overview

CEDA manages archives of CMIP6, CORDEX, Obs4MIPs etc. In most cases, we do not hold all the data for these projects. CEDA/JASMIN users can request that we obtain and archive additional datasets from ESGF.

Aim

We need to create, and document, a procedure for the above.

Workflow

The basic user request workflow is:

User contacts support@ceda.ac.uk with a request.
CEDA staff discuss the request and convert it into an appropriate format (the SELECTION_FILE).
Synda is run (maybe as a user with read-only to the DB:
- identify all datasets that would be replicated (and store as DATA_REQUESTED file).
- calculate the size of the request
If the request is large (>200GB), then CEDA uses an agreed process for deciding whether to action or reject the request.
CEDA contacts the user to confirm the decision in (4).
CEDA adds the SELECTION_FILE to the synda queue.
At agreed follow-up time(s), CEDA checks whether the DATA_REQUESTED file has been satisfied (i.e. all datasets are replicated, archived and published). - we will need to create some kind of scripts for this.
CEDA contacts the user to confirm the outcome.

Information management and documentation

In order to support this service, we need to provide:

Information on the Help Page about requesting data. Add to: https://help.ceda.ac.uk/article/4801-cmip6-data
Links to that Help Page from the CMIP6 catalogue records.
A public explanation of the replication priorities in a user-friendly format:
- including ongoing requests/retrievals (e.g. those that will just pick up new models/exps when they are created)
- including current requests (for specific users and/or projects)
- including historical requests
Create wiki pages to explain the procedures and documentation

Proposed information management system

HelpScout issues will be used to manage the user interactions:

User sends a message to create a query in HS.
We respond with a template response to gather all required info, and ask them all the appropriate questions.
The selection file is named after the HS Query Number
Could use timestamps in file names to indicate when they were created/submitted etc.
Have a few directories for each class of request, such as:
- core_ongoing
- core_historical
- user_current
- user_historical

Public/private sensitivities

Any discussions about a specific request that are required to be kept out of the public domain can take place via the CEDA Helpdesk query.

Content for the HelpScout Response Template

The Response Template can include:

Have you already downloaded all/some of this data on JASMIN? If yes, please delete your current copy in order to make space for other users/data.

Other issues

We will need a way of managing multiple requests over time - i.e. a user might put in many small requests that mount up to a large volume.

Workflow as agreed by @agstephens @alaniwi @charliepascoe (16/11/2021):

A query comes in (if not via HelpScout (HS) then we ask the user to send it to: support@ceda.ac.uk)
AI/AS responds using HS template response (unless the user has already provided a very clear requirement that does not need further discussion)
User responds via HS
AI/AS converts the request to selection file(s)
AI/AS runs Synda (on Synda machine) using selection file(s) to get an estimate of the volume
If large (>250GB): discuss with AS
If CEDA says too big: tell user and STOP (or agree smaller request)
If the volume is OK: continue
Add the selection file(s) to the appropriate user directory in this repository using the naming convention
Add, commit, and push to GitHub
AI/AS tells user: replication initiated, will review in N days
After N days: AI/AS to review request
If completed: tell user
- NOTE: we have a script to check whether queries have for a given selection file.
If not yet completed: Go to 11 (unless AI/AS assesses that job will not complete)
Move the selection_file(s) from the user_current to user_historical directory based on the agreed file-naming/directory-naming conventions (#6)

cedadev / cmip6-replication