SIG: Statistical Analysis and Comprehension of the Human Cell Atlas in R/Bioconductor

stephaniehicks commented 6 years ago

Introduction of yourself: Bioconductor developers involved in the Chan Zuckerberg Initiative (CZI) to develop collaborative computational tools for the Human Cell Atlas (HCA).

Should it be held during Developer Day? Preferably, yes.

Desired outputs:

Want to provide an update to the BioC community on our ongoing project with the CZI-HCA
Want to have dedicated time for BioC developers involved in this project to meet at the Bioconductor 2018 conference to discuss progress and work on packages
Want to get feedback from larger BioC community on the project and discuss ideas for long(er)-term funding for this work

Description of the topic: International projects generating large amounts of single-cell data, such as the Human Cell Atlas (HCA), have led to a great demand from researchers for fast, scalable, and efficient infrastructure and tools to analyze and to effectively extract knowledge from billions of single cells. This led to a call for applications for funding from the Chan-Zuckerberg Initiative (CZI) to develop collaborative computational tools to access, analyze and understand data from the HCA. The Bioconductor community submitted a joint proposal in August 2017 titled the Statistical Analysis and Comprehension of the Human Cell Atlas in R/Bioconductor and we were recently awarded funding for one year to (1) provide a coherent programmatic interface to the HCA, and (2) enable scalable interactive statistical analysis of large single-cell data. This birds-of-a-feather session is to provide a summary of what was done in the past year and what we plan to do in the next year. Our project aims are:

Enable HCA Data Coordination Platform (DCP) access through R / Bioconductor.
Develop standard representations of large single-cell data in semantically rich R / Bioconductor objects using established design principles.
Develop scalable data preprocessing and normalization pipelines that account for systematic bias and unwanted variability.
Implement fast and efficient algorithms scalable to billions of cells.
Facilitate finding and working with HCA data through ontology bindings.

A description the principal investigators and their role in the project is provided here:

Dr. Martin Morgan (Roswell Park Alliance Foundation): Access and Scalable Infrastructure for R / Bioconductor.
Dr. John Marioni / Dr. Aaron Lun (EMBL European Bioinformatics Institute): Enhancing the Accessibility of Large Single-Cell Data Sets in R / Bioconductor.
Dr. Wolfgang Huber / Dr. Michael Smith (EMBL Heidelberg) Enhancing the R / HDF5 Interface.
Dr. Raphael Gottardo (Fred Hutchinson Cancer Research Institute): A Computational Infrastructure for Efficient Single-Cell Data Management, Visualization and Analysis
Dr. Rafael Irizarry (Harvard T. H. Chan School of Public Health) / Dr. Christina Kendziorski (Department of Biostatistics & Medical Informatics, University of Wisconsin): Collaborative and Open Normalization, Inference, and Discovery Data Analysis Pipelines for the Human Cell Atlas.
Dr. Kasper D. Hansen. (John Hopkins University) Scalable Computations with Large Single-Cell Datasets. Personnel includes Dr. Peter Hickey (postdoctoral fellow w/ Hansen), the lead developer of DelayedMatrixStats.
Dr. Davide Risso (Weill Cornell Medicine) / Dr. Stephanie Hicks (Johns Hopkins University): Fast and Efficient Implementations of Common Clustering Algorithms for Large Single-cell Data.
Dr. Vincent J. Carey (Harvard Medical School) / Dr. Aedin Culhane (Dana Farber Cancer Institute/Harvard T. H. Chan School of Public Health) Ontology-Driven Interfaces and Inter-systems Architecture for Building and Using the Human Cell Atlas.

Finally, in the birds-of-a-feather session we will discuss and highlight existing and proposed Bioconductor software aimed at the analysis of single-cell data to accomplish the aims of this project. For example, we have developed a unified representation for single-cell data with the SingleCellExperiment S4 class, which is an extension of the popular SummarizedExperiment class. In the past year, this class has been widely incorporated into many popular Bioconductor single-cell packages (e.g. scater, MAST, scDD, scPipe, scran, splatter, zinbwave, DropletUtils, clusterExperiment, SC3, destiny, and BASiCS) enabling improved interoperability between packages. To make tools and analyses scalable to millions of cells, we have proposed Bioconductor infrastructure and efficient data representations for large single-cell data with millions or billions of cells. This infrastructure is primarily based on out-of-memory computations with Bioconductor packages such a HDF5Array (implements HDF5-based on-disk representation), DelayedArray (implements lazy manipulation for efficient interactive analyses), rhdf5client (facilitates use of HDF Server or HDF Cloud for remote array data), and BioCParallel (standardizes parallel processing throughout the Bioconductor ecosystem).

miaozhun commented 6 years ago

Hi, Stephanie, thank you for the excellent proposal. I'm Zhun Miao from Tsinghua University of China, and I'm very interested to participate the group. Thank you! See you then!

stephaniehicks commented 3 years ago

Closing this issue.

Bioconductor / BioC2018

SIG: Statistical Analysis and Comprehension of the Human Cell Atlas in R/Bioconductor #5