Bioconductor / BioC2019

BioC2019: Where Software and Biology Connect
6 stars 9 forks source link

SIG: Bioconductor Infrastructure for Base Modifications #35

Open Shians opened 5 years ago

Shians commented 5 years ago

Introduction

I am a new PhD Student at the Walter and Eliza Hall institute in Melbourne, Australia. My project is based around methods and tools for the analysis of DNA methylation in long reads using Oxford Nanopore sequencers. My formal background is in statistics but I mainly work on developing software and have a keen interest in efficient and user-friendly computational methods and visualisation.

Expected attendees

Researchers who are interested in base modifications of all kinds, I am interested in DNA but the developed structure should equally support RNA modifications.

Should it be held during Developer Day

Probably

Description of the topic

(Will update this section after I do some more research and take suggestions)

I think there are things to keep in mind for this:

As far as I'm aware there's not a specialised widely supported Bioconductor structure for storing base modification information that also facilitates straightforward querying of common issues. The basics would be to ask for the methylation proportions in a specific region, there should be metadata within objects to separate groups for which this can be asked as well as reporting of coverage at the loci. Additionally it would be useful to query within-read methylation patterns, to inspect correlation between methylation sites within molecules. Compactness of representation is also going to be important, sparse or on-disk representations would be useful to consider, features and query performance probably take second place to storage size.

Desired outcome

I'd like to establish a set of queries of interest and a general abstract idea of what data structure(s) might be appropriate.

Shians commented 5 years ago

Updated description with

  • Interoperability with genomic data structures, down the line it's very likely that methylation and mRNA expression will be analysed together, facilitating this kind of analysis is of great interest.
mtmorgan commented 5 years ago

@Shians I scheduled this for the Smillow Seminar Room 3:30 - 4:30 today

PeteHaitch commented 5 years ago

I'm taking notes which I'll post here

omsai commented 5 years ago

From the discussion it sounds like GPos or GRanges would work, some implementation of HDF5Array for the Oxford Nanopore FAST5 reads.

For a general description of modifications, if we want to look toward other projects for consistency, the NCBI C++ toolkit has an analogous SeqFeatData data structure for single base modifications:

site: A Defined Site

The site feature annotates a know site from the following specified list. If the site is “other” then Seq-feat.comment should be used to explain the site.

  • active (1) ,
  • binding (2) ,
  • cleavage (3) ,
  • inhibit (4) ,
  • modified (5),
  • glycosylation (6) ,
  • myristoylation (7) ,
  • mutagenized (8) ,
  • metal-binding (9) ,
  • phosphorylation (10) ,
  • acetylation (11) ,
  • amidation (12) ,
  • methylation (13) ,
  • hydroxylation (14) ,
  • sulfatation (15) ,
  • oxidative-deamination (16) ,
  • pyrrolidone-carboxylic-acid (17) ,
  • gamma-carboxyglutamic-acid (18) ,
  • blocked (19) ,
  • lipid-binding (20) ,
  • np-binding (21) ,
  • dna-binding (22) ,
  • other (255)
FelixErnst commented 5 years ago

For a dictionary of DNA or RNA nucleotide modifications have a look at the Modstrings package. It also contains functions for turning a sequence of nucleotides containing modified nucleotides into GRanges object containing the coordinates (see functions separate and combineIntoModstrings)

https://bioconductor.org/packages/release/bioc/html/Modstrings.html

edit: my2cents on the data structures:

PeteHaitch commented 5 years ago

Thanks, Felix! Your work on RNAmodR came up during discussions, but I forgot about ModStrings!

PeteHaitch commented 5 years ago

Here's my notes from the day (copied below):

FelixErnst commented 5 years ago

RNAmodR is basically structured like this:

General ideas and experiences:

FelixErnst commented 5 years ago

@Shians fyi: It is probably quite easy to turn RNAmodR into a modR package and extend a separate RNAmodR and DNAmodR package. Let me know, if that might be of interest to you.

Shians commented 5 years ago

Thanks Felix, I’m on vacation and will have a look at this next week to properly digest it. If memory may be a problem, I’ve been itching to learn some on-disk methods.

FelixErnst commented 5 years ago

@Shians Due to the recent changes in the DataFrame package, I changed the structure of the SequenceDataFrame class. In theory, It supports now different backend as outligned @hpages in the recent Devtalk (see Slack). However, I didn't have any opportunity to test it, yet, because it still requires some change in the S4Vectors package. You might want to stay tuned to changes coming up there, if you are still working on this. Once the listData slot moves to the DFrame class, different backends can be supported and combined with the SequenceDataFrame class.

At the same time I introduced the RNAModifier and DNAModifier class to distinguish RNA and DNA detection strategies. Also, the SequenceDataFrame class now has a seqtype() getter and setter to change between RNA and DNA sequences. For more details have a look at the recent changes in RNAmodR package.