SIG: Bioconductor Infrastructure for Base Modifications

Bioconductor / BioC2019

BioC2019: Where Software and Biology Connect

6 stars 9 forks source link

SIG: Bioconductor Infrastructure for Base Modifications #35

Open Shians opened 5 years ago

Shians commented 5 years ago

Introduction

I am a new PhD Student at the Walter and Eliza Hall institute in Melbourne, Australia. My project is based around methods and tools for the analysis of DNA methylation in long reads using Oxford Nanopore sequencers. My formal background is in statistics but I mainly work on developing software and have a keen interest in efficient and user-friendly computational methods and visualisation.

Expected attendees

Researchers who are interested in base modifications of all kinds, I am interested in DNA but the developed structure should equally support RNA modifications.

Should it be held during Developer Day

Probably

Description of the topic

(Will update this section after I do some more research and take suggestions)

I think there are things to keep in mind for this:

Support for long reads, I don't think this is an big issue, I'm not aware of GenomicRanges having any limitations with length of reads, but since I'm interested in Nanopore sequencing, it's vitally important to have this support.
Read-based tracking, since I'm thinking about long reads, I can potentially detect when two sites along a read have correlated or anti-correlated methylation patterns on the same molecule. So I want to not only keep track of this information but efficiently make queries based on it.
Support for RNA modifications, there are over a hundred of these, I think extended alphabets are sometimes used for representing DNA modifications but that's likely not feasible without creating FASTQ qual-string-like monstrosities.
Interoperability with genomic data structures, down the line it's very likely that methylation and mRNA expression will be analysed together, facilitating this kind of analysis is of great interest.

As far as I'm aware there's not a specialised widely supported Bioconductor structure for storing base modification information that also facilitates straightforward querying of common issues. The basics would be to ask for the methylation proportions in a specific region, there should be metadata within objects to separate groups for which this can be asked as well as reporting of coverage at the loci. Additionally it would be useful to query within-read methylation patterns, to inspect correlation between methylation sites within molecules. Compactness of representation is also going to be important, sparse or on-disk representations would be useful to consider, features and query performance probably take second place to storage size.

Desired outcome

I'd like to establish a set of queries of interest and a general abstract idea of what data structure(s) might be appropriate.

Shians commented 5 years ago

Updated description with

Interoperability with genomic data structures, down the line it's very likely that methylation and mRNA expression will be analysed together, facilitating this kind of analysis is of great interest.

mtmorgan commented 5 years ago

@Shians I scheduled this for the Smillow Seminar Room 3:30 - 4:30 today

PeteHaitch commented 5 years ago

I'm taking notes which I'll post here

omsai commented 5 years ago

From the discussion it sounds like GPos or GRanges would work, some implementation of HDF5Array for the Oxford Nanopore FAST5 reads.

For a general description of modifications, if we want to look toward other projects for consistency, the NCBI C++ toolkit has an analogous SeqFeatData data structure for single base modifications:

site: A Defined Site

The site feature annotates a know site from the following specified list. If the site is “other” then Seq-feat.comment should be used to explain the site.

active (1) ,

binding (2) ,

cleavage (3) ,

inhibit (4) ,

modified (5),

glycosylation (6) ,

myristoylation (7) ,

mutagenized (8) ,

metal-binding (9) ,

phosphorylation (10) ,

acetylation (11) ,

amidation (12) ,

methylation (13) ,

hydroxylation (14) ,

sulfatation (15) ,

oxidative-deamination (16) ,

pyrrolidone-carboxylic-acid (17) ,

gamma-carboxyglutamic-acid (18) ,

blocked (19) ,

lipid-binding (20) ,

np-binding (21) ,

dna-binding (22) ,

other (255)

FelixErnst commented 5 years ago

For a dictionary of DNA or RNA nucleotide modifications have a look at the Modstrings package. It also contains functions for turning a sequence of nucleotides containing modified nucleotides into GRanges object containing the coordinates (see functions separate and combineIntoModstrings)

https://bioconductor.org/packages/release/bioc/html/Modstrings.html

edit: my2cents on the data structures:

alphabet: I used the alphabet from the modomics project (http://modomics.genesilico.pl) and the publication/website from the Hoffman lab (https://dnamod.hoffmanlab.org, now published here https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0349-4)
for a linear format I reused the structure and functions from the Biostrings package. The functions for comparison of objects does work out of the box. Saving these objects: FASTQ files can contain special characters. The current set of modifications in the Modstrings package is compatible with the fastq format, so that a single score per nucleotide can be stored as well (see QualityScaledModDNAString). Fasta files work as well. Files of any format need to be UTF-8 endcoded. So I think this covers a linear format. Letter, e.g. nucleotides, are stored internally as integer values using the Biostrings backend. The section NCBI8na: An Eight Bit Sequential Encoding for Modified Nucleic Acids from @omsai mentions this as well, to I guess it is compatible to that idea. The conversion is fixed using the Mod_DNA_codes.txt and Mod_RNA_codes.txt in the inst/extdata directory of the package and can only by extended from this point onward to keep compatibility with older objects.
For a tabular format it gets more tricky: By definition a GRanges object refers to genomic coordinates, which is fine for DNA, but not ideal for RNA. I don't see an alternative, since a transcript is always tied to the genome and therefore genomic coordinates. Maybe a class of TRanges (transcript ranges) can be envisioned, but I think it currently wouldn't have any benefit. I toyed with the idea of extending a ModRanges class from the GRanges class to fix a column for the modification identifier and score. Due to time and no clear necessity, I didn't implement it.
certainty versus frequency: I don't how one should/could/would specify the distinction. Sometimes calculated scores represent the frequency of nucleotide being modified, sometimes it is just about certainty. This is mostly based on the capability of the experimental method. Maybe a distinction could be fixed in a ModRanges class along with a definition for a FASTQ derived format.
regarding the SeqFeatData data structure from @omsai : The modomics project contains a new nomenclature and also contains a brief description of the logic behind it (https://doi.org/10.1093/nar/gkx1030). However, I was able to use it only partially, since some inconsistencies exist (see R/Modstrings-separate.R#L178). Nonetheless, the new nomenclature approach has some logic behind it, which is very similar to the idea outlined on NCBI github page and it could be harnessed to describe the building blocks of modification rather then the modification itself. The same approach could be applied to the DNA alphabet of the Hoffman lab.
Meta information: think somethings can be repurposed from other packages, but in general I think a new package could be the right place to convert genomic positions to metagene or metatranscript positions for in depth analysis strategies. For this a class like MetaRanges could be implemented, holding the information in a streamlined aggregated format.

PeteHaitch commented 5 years ago

Thanks, Felix! Your work on RNAmodR came up during discussions, but I forgot about ModStrings!

PeteHaitch commented 5 years ago

Here's my notes from the day (copied below):

Some options for storing the sequence
- E.g., ...ATmCGA..., where mC denotes methylated cytosine
  - Extended alphabet
  - Rle encoding of modification vector
  - Parallel vectors
    - DNA bases (A, C, G, T)
    - DNA modifications (m, ca, f, etc.)
  - https://hoffmanlab.org/proj/dnamod/ is a database of known/suspected modifications
The HTSspecs are currently hashing out ideas for methylation tags https://github.com/samtools/hts-specs/pull/418 and https://github.com/samtools/hts-specs/issues/362
- Seems to be methylation-specific, what about other modifications?
Need a readGAlignments() equivalent for Nanopore data?
- For reading squiggles (FAST5): poRe (sourceforge),
- For processed data: it's basically a CSV file so any old CSV reader will work
Challenges of FAST5 (squiggle format)
- Format is a bit hard to understand, but it's 'just' an HDF5 file
- More tools available in Python than R (for now)
Mostly talked about Nanopore but there's also PacBio
- Is there any support for PacBio in R/Bioconductor?
Q: Could there be common interests with researchers who study RNA post-translational modification
- E.g., the recent RNAmodR package
- Need to take a look at the low-level functionality of this package
Common operations of methylation data
- Find overlaps of features (e.g., promoters)
- Read-level / haplotype-level analysis of methylation
  - As metadata on a GRanges
  - Adding a 'haplotype' or 'allele' to each locus (a la strand)
Also some shared features of DNA genotyping and linkage disequilibrium
- Getting methylation value at a locus is like a pileup
- Looking at methylation along a read is like LD / haplotyping
GPos/GRanges + metadata for storing locus-level data
Perhaps GPosList/GRangesList for read-level data with GPos/GRanges for each modified locus

FelixErnst commented 5 years ago

RNAmodR is basically structured like this:

virtual SequenceData class, which are derived from CompressedSplitDataFrameList, and add a sequences and ranges slot. (SequenceDataFrame is used for the unlistData slot. unlisting and relisting is supported)
specific SequenceData classes contain one type of information, e.g. 5'- or 3'-end of reads (relies on readGAlignment to load data. Each type can be reused for different detection strategies).
Modifier objects wrap SequenceData objects (or SequenceDataSet/SequenceDataList).
each SequenceData class "knows" how to sum up data from the two conditions (treated/control) using the aggregate function.
Modifier objects work on the aggregate data from SequenceData object to calculate experiment specific values per position and store the result in the Modifier's aggregate slot.
based on some identifying marks, e.g. score thresholds, Modifier objects subsets to specific positions and reports these as modified nucleotide positions (the result is a GRanges object compatible with the combineIntoModstrings function from the Modstrings package. width() == 1L).
ModifierSet objects contain Modifier objects from different experiments.
subsetting and comparison functions are implemented for SequenceData/ Modifier/ ModifierSet objects. Coordinates for subsetting are provided as GRanges objects.
visualization is currently done via Gviz. trackViewer might be an option, since it offers a bit more capability to mark special nucleotides.
adapter function for machine learning are also available since last week. Feature engineering can be done through the aggregate function of Modifier class and the dimension of a tensor can be controlled using the trainingData function, which is just a special subsetting function. (see RNAmodR.ML for example)
See RNAmodR.AlkAnilineSeq for an example of implementing a new strategy. It is very straight forward and requires only 3 files.

General ideas and experiences:

Most challanging situation is a full transcriptome wide modification search on human data, since available memory becomes an issue. Example/Experience so far: Full human transcriptome requires 4-5 Gb per Bamfile input. So with two replicates thats 20Gb. After subsetting to relevant transcripts (coverage > 50) ~6 Gb remain (> 50 mil reads per bam file as input). The HDF5 backend was therefore on my radar.
For a genomic search, which includes intergenic regions, memory requirements will go through the roof.

FelixErnst commented 5 years ago

@Shians fyi: It is probably quite easy to turn RNAmodR into a modR package and extend a separate RNAmodR and DNAmodR package. Let me know, if that might be of interest to you.

Shians commented 5 years ago

Thanks Felix, I’m on vacation and will have a look at this next week to properly digest it. If memory may be a problem, I’ve been itching to learn some on-disk methods.

FelixErnst commented 5 years ago

@Shians Due to the recent changes in the DataFrame package, I changed the structure of the SequenceDataFrame class. In theory, It supports now different backend as outligned @hpages in the recent Devtalk (see Slack). However, I didn't have any opportunity to test it, yet, because it still requires some change in the S4Vectors package. You might want to stay tuned to changes coming up there, if you are still working on this. Once the listData slot moves to the DFrame class, different backends can be supported and combined with the SequenceDataFrame class.

At the same time I introduced the RNAModifier and DNAModifier class to distinguish RNA and DNA detection strategies. Also, the SequenceDataFrame class now has a seqtype() getter and setter to change between RNA and DNA sequences. For more details have a look at the recent changes in RNAmodR package.