SIG: Extending gene set and signature representations

llrs commented 5 years ago

Introduction of yourself: We are three developers involved in this proposal:

Lluís Revilla Sancho, a PhD student interested in gene sets and how to store and analyse cell signatures, pathways and sets related to diseases, process and state of the cells as a way to learn more insight in the biology of the cells. Developer of BioCor, and contributor to other packages included in Bioconductor such as GOSemSim, fgsea.

Kevin Rue-Albrecht, a postdoctoral researcher in computational biology at the University of Oxford. My main interests are in immunology and (single-cell) transcriptomics. My background includes a MSc in Bioengineering, Bioinformatics and Modelling for Biology, and a PhD in Computational Infection Biology. Contributions to Bioconductor include GOexpress, TVTB, and iSEE.

Kayla Morrell is a developer at Bioconductor.

Expected attendees: Users of gene sets containers (e.g. GSEABase classes) wishing to discuss the limitations of existing classes for their use cases, and developers interested in developing efficient representations of gene sets and signatures collaboratively for the community.

Should it be held during Developer Day? Preferably, yes.

Description of the topic: The GSEABase package provides classes and methods oriented to store and manipulate sets of genes such as KEGG pathways, Gene Ontology terms,Broad collections and other ontologies (using derivatives of the GeneSet and GeneSetCollection classes) . The GSEABase implementation of the classes makes it difficult to use set operations like subtraction, intersection, union, complement and it is slow to work with for more than few hundred of gene sets. In addition, those existing containers lack infrastructure to store information associated with individual elements, sets, and relationships between elements and sets.

The recent development of high throughput genomics technologies facilitated the generation several data types: single cell expression, methylation, chromatin disponibility, microorganism presence and transcriptome, proteomics, metabolomics, T cell receptors, B-cell receptors, etc. A current research interest uses several of these features to accurately describe phenotypes. For instance, single-cell expression is usually relevant to define new cell lines using a group of genes being expressed. But with other types of data new classes should be created in the GSEABase package. However, this new classes would not overcome the slowness of the existing structure nor improve the ease of use to new users.

Over the lasts months a coordinated effort among members of the Bioconductor community has explored possibilities for novel gene set containers. We have discussed how a new class could solve this problems, (a public record can be found here). We developed new containers with the GSEABase functionality, while simplifying internally the structure in three tables (one for elements, one for sets and one for the relationships between them), allowing non-quoted evaluation. Three packages were developed to explore different implementations:

This special interest group session will provide a summary of what was done in the past months by each developer, hear their feedback and plan accordingly taking into account the project aims:

The container should be capable of store relationships between elements and sets, and associated information and methods
The container should be fast and efficient to store large number of sets (hundreds or thousands)
Follow the tidyverse principle to facilitate the usage to the user, allowing to use quoted evaluation in its methods

In the birds-of-a-feather session we will discuss the proposed software aimed at the analysis of sets. Explore the needs and desires of other developers currently using GSEABase classes and explore what the users and developers might need.

Desired outcome: Primarily, a general discussion and brainstorming of desirable features expected from prospective users and developers. It would be particularly useful to identify relevant use cases, for example in single-cell RNA-seq workflows intended to generate and apply cell type signatures. Also deciding on which package focus the development and to be used by other packages and developers.

Secondary outputs could include code, notebooks, gists, documentation, and opening issues on the existing repositories.

mtmorgan commented 5 years ago

@llrs @kevinrue @Kayla-Morrell I scheduled this for the Farkas Auditorium, 3:30 - 4:30 today.

llrs commented 5 years ago

Great! I'll be there

kevinrue commented 5 years ago

Noted. I'll be there too.

llrs commented 5 years ago

Slides are here: https://bit.ly/2xaXbDt

Bioconductor / BioC2019

SIG: Extending gene set and signature representations #25