DataBiosphere / data-platforms

Components of the Commons Alliance
https://databiosphere.github.io/data-platforms/
1 stars 1 forks source link

Data Platforms

This repository gathers together components that take part in the Commons Alliance. These components describe interfaces and features that can be assembled to create "data platforms" useful for storing, and performing reproducible analyses on data and metadata.

This is a living document.

Note, if you are viewing this on github, the images may be cached, please visit:

https://databiosphere.github.io/data-platforms/

For more background read the Data Biosphere post.

Visit the DataBiosphere github organization.

Metadata Serialization

Communication between data platforms requires that metadata are serialized in a useful and predictable manner. This document describes approaches and case studies in use by some components.

View the Metadata Serialization document.

Identifier Interoperability

When the same metadata are present in multiple locations, it is critical to provide guarantees of identity that are useful and portable. This document describes approaches to presenting and using interfaces that allow identifiers to be usefully exchanged.

View the Identifier Interoperability document.

Prototype

The prototype components of a Commons member

The various components coordinate to create a platform useful for data analysis.

Digital Object Catalog

Provides clients and services access to resources available in object stores. Digital objects can be files and the catalog itself maintains a registry of locations to find the files, as well as minimal metadata.

GUID Resolver

Allows globally unique identifiers to be "resolved" to digital objects. For more information please refer to Identifier Interoperability.

Namespace Service

Identifiers can be given different namespaces or "prefixes". The namespace service allows commons members to easily manage GUIDs across projects and domains. For more information please refer to Identifier Interoperability.

Data Access

Once data have been discovered they must be localized, which requires interacting with object stores and performing authentication, authorization, downloading, and transfer as necessary.

Access Control

To guarantee authority and authenticity of requests, some access control services are provided. These services will at least be able to identify a client and delegate authority to the access control system of choice.

Analytical Engine

Software which can orchestrate and execute computational tasks in heterogeneous computing environments.

Tool Repository

A resource which contains templates of reusable computational tasks that can be directed at new data, and then executed by the Analytical Engine.

Workspaces

Clients accessing a commons infrastructure should be able to manage data for secondary and tertiary data analysis.

Indexing and Search

Data in commons infrastructure should be findable using Search mechanisms. Indexing makes data available for search.

Ontology

A controlled vocabulary informs indexers and or querying applications to make metadata usable.

Metadata Indexer

Metadata made available by a platform is indexed into a store. Indexers allow data to be made findable using a structured document scheme.

Metadata Querying

Once metadata have been indexed into a platform, these indices are made available by services that allow queries to be formed against the metadata.

Portal

Commons infrastructure should provide interfaces to make data easily findable. Once data has been found in a portal, it can be added to a workspace.

Application

Applications combine a variety of Commons components to carry out specific tasks.

Commons Alliance Components

Source Code Repository Table

Links to source code repositories for implementations are provided below:

Component Broad UChicago CDIS UCSC CGP
Digital Object Catalog
GUID Resolver indexd* dos-azul-lambda*
Namespace Service indexd*
Data Access fence cgp-data-store
Access Control
Authorization sam bond fence
Authentication sam bond fence
Analytical Engine Cromwell**
Leonardo
toil
Tool Repository Agora* Dockstore*
Workspaces Rawls jupyterhub
Indexing and Search Orchestration
Ontology datadictionary
Metadata Indexer Orchestration sheepdog azul-indexer
Metadata Querying Orchestration peregrine azul-webservice
Portal Firecloud windmill boardwalk
Application xena

Applications marked with a * implement a standard interface being developed with the GA4GH. Clients can interact with these applications using an open protocol

UChicago CDIS

The University of Chicago, CDIS groups presents software for easily managing the submission and access control of bioinformatics and medical informatics data in cloud environments.

An image of the UC CDIS commons services

UC Santa Cruz Computational Genomics Platform

An image of the UCSC commons services

Broad Institute

This section is in progress

An image of the Broad commons services

Development

This document is under active development. If you feel misrepresented or something has been miscommunicated, please open an issue or make a Pull Request!

Editing diagrams

The program used to edit the "dia" files is dia.

Github caches images when they display READMEs so be sure to check the actual file if it seems out of date!