Before scheduling this work, I propose waiting to stand-up the Jupyterhub cluster we plan to host in our AWS account to give remote curation access to curators. This work may alleviate the pain points that led to this request such that its not as pressing to implement this feature.
Edit--added context: The semi-automation requests were made to alleviate how long and how many local resources downloading/uploading large datasets take to process relatively small changes. Implementing remote semi-automation of dataset transformations has to be incredibly limited/parametrized or otherwise undergo a rigorous security review, as it is risky to allow even authenticated users to submit arbitrary code snippets to run against parts of our corpus. It is preferable to see if the JupyterHub solution alleviates concerns about long curation times for large datasets enough to avoid semi-automation.
Create endpoint PATCH /v1/collections/{collection_id}/datasets for submitting uns + obs dataset updates across a revision
Propose a request schema capturing common patterns for scripting dataset updates (i.e. {"update":{"cell_ontology_term_id": "CL:0000001", "donor_id": 1}} to capture 'update CL to CL:0000001 if donor_id is 1')
Input validation in the endpoint to determine if its a valid update action
trigger batch job that will parse request schema and perform the actions on datasets in the revision if conditions are met (or on all datasets in the revision if no conditions to check)
Create a new batch job similar to dataset_metadata_update.py that iterates across datasets in a revision, parses the request schema for actions, and creates new dataset versions + applies applicable actions to the relevant dataset H5ADs (while updating dataset statuses accordingly). Then, apply those updates to the rds and cxg artifacts as well without going through the full conversion process.
NOTE: unlike dataset_metadata_update.py, this batch job WILL require re-triggering the validation step if there are updates to the obs
To counter redundant work, Create an optional flag in the cellxgene-schema CLI to skip validating the X / raw.X matrix. Validating with this option will cover any potential new validation issues and avoid the most time / memory expensive operation.
Discovery ticket:
Before scheduling this work, I propose waiting to stand-up the Jupyterhub cluster we plan to host in our AWS account to give remote curation access to curators. This work may alleviate the pain points that led to this request such that its not as pressing to implement this feature.
Edit--added context: The semi-automation requests were made to alleviate how long and how many local resources downloading/uploading large datasets take to process relatively small changes. Implementing remote semi-automation of dataset transformations has to be incredibly limited/parametrized or otherwise undergo a rigorous security review, as it is risky to allow even authenticated users to submit arbitrary code snippets to run against parts of our corpus. It is preferable to see if the JupyterHub solution alleviates concerns about long curation times for large datasets enough to avoid semi-automation.
This solution is an evolution of the proposed solution in https://app.zenhub.com/workspaces/single-cell-5e2a191dad828d52cc78b028/issues/gh/chanzuckerberg/single-cell/145, which is focused on uns-only updates (which is a simpler problem we already have the infrastructure to accommodate)
Estimate: 5-6 weeks x 2 engineers
Proposed Approach (align with a tech spec first):