DiSSCo / SDR

Specimen Data Refinery
Apache License 2.0
7 stars 0 forks source link

Deliverable 8.3 (June 2022) #78

Open llivermore opened 2 years ago

llivermore commented 2 years ago

Description Corresponds to SYNTHESYS+ Task 8.3 "Development of cloud platform for data-processing services."

There were no subtasks for this task and associated Deliverable, but the following are described in the description:

  1. web-based cloud platform will handle the execution of workflows held in the registry (task 8.2.6)
  2. an (openly?) accessible user interface
  3. authentication and authorisation infrastructure (AAI) such as that provided by ELIXIR
  4. a common entry webpage and API endpoint
  5. ability to upload datasets and images, and define their own templates
  6. content can then be retrieved in an appropriate format and standardised for further processing or publication
  7. new metadata generated from the workflows plus the detailed standardised provenance [...] will be gathered with the samples, packaged into selfdescribing Research Objects (ROs) and exported to an external datastore
  8. The SDR will retain a ledger logging all specimen identifiers (specimens are uniquely identified) alongside details of ROs, to provide a full specimen log profile and support analytics to further improve the workflows and their services

We do not need to do all of these, but should discuss and decide which are worth doing.

The discussion about integration with Cordra and using DiSSCo's PID system should be part of this Deliverable.

Associated issues:

(check rest of issues - may need to add more)

llivermore commented 2 years ago

Full task description:

This task will provide the SDR Execution Platform. This web-based cloud platform will handle the execution of workflows held in the registry (task 8.2.6) and the user interface which will be openly accessible. This will be an adapted version of the FAIRDOM SEEK platform and include newly developed and existing components.

An authentication and authorisation infrastructure (AAI) such as that provided by ELIXIR to support access will be developed. The SDR Execution Platform will have a common entry webpage and API endpoint. The webpage would direct users to tools or workflows held in the registry, depending on their requirements. Administrators upload datasets and images, and define their own templates including instructions on how human content providers are monitored. Generated content can then be retrieved in an appropriate format and standardised for further processing or publication.

For the processed images, new metadata generated from the workflows plus the detailed standardised provenance (using the W3C PROV standard) of the distributed data processing will be gathered with the samples, packaged into selfdescribing Research Objects (ROs) and exported to an external datastore. As several different workflow runs will commonly be needed in order to generate different kinds of specimen metadata, the SDR will ingest previously processed ROs and/or retain a temporary collection of specimen ROs depending on the protocol.

The SDR will retain a ledger logging all specimen identifiers (specimens are uniquely identified) alongside details of ROs, to provide a full specimen log profile and support analytics to further improve the workflows and their services. The platform will accommodate generic and bespoke programming languages and modules as well as Common Workflow Language as the means of linking bespoke and standardised services and tools. The running of workflows on computational infrastructures will leverage initiatives such as Toil and cwl-tes to aid in the development of high quality data pipelines.

All services and workflows will be systematically scrutinised based on existing maturity models, automated monitoring and performance testing to ensure their fitness. The SDR will deploy containerised services on High Performance Computing infrastructures provided by the European Open Science Cloud (EOSC) and the European data infrastructure project (EUDAT). The Refinery working data store will be hosted on EOSC or AWS infrastructure.