Open Shians opened 5 years ago
Updated description with
- Interoperability with genomic data structures, down the line it's very likely that methylation and mRNA expression will be analysed together, facilitating this kind of analysis is of great interest.
@Shians I scheduled this for the Smillow Seminar Room 3:30 - 4:30 today
I'm taking notes which I'll post here
From the discussion it sounds like GPos
or GRanges
would work, some implementation of HDF5Array
for the Oxford Nanopore FAST5 reads.
For a general description of modifications, if we want to look toward other projects for consistency, the NCBI C++ toolkit has an analogous SeqFeatData
data structure for single base modifications:
site: A Defined Site
The site feature annotates a know site from the following specified list. If the site is “other” then Seq-feat.comment should be used to explain the site.
- active (1) ,
- binding (2) ,
- cleavage (3) ,
- inhibit (4) ,
- modified (5),
- glycosylation (6) ,
- myristoylation (7) ,
- mutagenized (8) ,
- metal-binding (9) ,
- phosphorylation (10) ,
- acetylation (11) ,
- amidation (12) ,
- methylation (13) ,
- hydroxylation (14) ,
- sulfatation (15) ,
- oxidative-deamination (16) ,
- pyrrolidone-carboxylic-acid (17) ,
- gamma-carboxyglutamic-acid (18) ,
- blocked (19) ,
- lipid-binding (20) ,
- np-binding (21) ,
- dna-binding (22) ,
- other (255)
For a dictionary of DNA or RNA nucleotide modifications have a look at the Modstrings
package. It also contains functions for turning a sequence of nucleotides containing modified nucleotides into GRanges
object containing the coordinates (see functions separate
and combineIntoModstrings
)
https://bioconductor.org/packages/release/bioc/html/Modstrings.html
edit: my2cents on the data structures:
Biostrings
package. The functions for comparison of objects does work out of the box. Saving these objects: FASTQ files can contain special characters. The current set of modifications in the Modstrings
package is compatible with the fastq format, so that a single score per nucleotide can be stored as well (see QualityScaledModDNAString
). Fasta files work as well. Files of any format need to be UTF-8 endcoded. So I think this covers a linear format.
Letter, e.g. nucleotides, are stored internally as integer values using the Biostrings backend. The section NCBI8na: An Eight Bit Sequential Encoding for Modified Nucleic Acids from @omsai mentions this as well, to I guess it is compatible to that idea. The conversion is fixed using the Mod_DNA_codes.txt
and Mod_RNA_codes.txt
in the inst/extdata
directory of the package and can only by extended from this point onward to keep compatibility with older objects.GRanges
object refers to genomic coordinates, which is fine for DNA, but not ideal for RNA. I don't see an alternative, since a transcript is always tied to the genome and therefore genomic coordinates. Maybe a class of TRanges
(transcript ranges) can be envisioned, but I think it currently wouldn't have any benefit.
I toyed with the idea of extending a ModRanges
class from the GRanges
class to fix a column for the modification identifier and score. Due to time and no clear necessity, I didn't implement it.ModRanges
class along with a definition for a FASTQ derived format.SeqFeatData data structure
from @omsai : The modomics project contains a new nomenclature and also contains a brief description of the logic behind it (https://doi.org/10.1093/nar/gkx1030). However, I was able to use it only partially, since some inconsistencies exist (see R/Modstrings-separate.R#L178). Nonetheless, the new nomenclature approach has some logic behind it, which is very similar to the idea outlined on NCBI github page and it could be harnessed to describe the building blocks of modification rather then the modification itself. The same approach could be applied to the DNA alphabet of the Hoffman lab.MetaRanges
could be implemented, holding the information in a streamlined aggregated format.Thanks, Felix! Your work on RNAmodR came up during discussions, but I forgot about ModStrings!
Here's my notes from the day (copied below):
RNAmodR is basically structured like this:
SequenceData
class, which are derived from CompressedSplitDataFrameList
, and add a sequences and ranges slot. (SequenceDataFrame
is used for the unlistData
slot. unlisting and relisting is supported)SequenceData
classes contain one type of information, e.g. 5'- or 3'-end of reads (relies on readGAlignment
to load data. Each type can be reused for different detection strategies).Modifier
objects wrap SequenceData
objects (or SequenceDataSet
/SequenceDataList
).SequenceData
class "knows" how to sum up data from the two conditions (treated
/control
) using the aggregate
function.Modifier
objects work on the aggregate data from SequenceData
object to calculate experiment specific values per position and store the result in the Modifier
's aggregate
slot.Modifier
objects subsets to specific positions and reports these as modified nucleotide positions (the result is a GRanges
object compatible with the combineIntoModstrings
function from the Modstrings
package. width() == 1L
).ModifierSet
objects contain Modifier
objects from different experiments.SequenceData
/ Modifier
/ ModifierSet
objects. Coordinates for subsetting are provided as GRanges
objects.Gviz
. trackViewer
might be an option, since it offers a bit more capability to mark special nucleotides.aggregate
function of Modifier
class and the dimension of a tensor can be controlled using the trainingData
function, which is just a special subsetting function. (see RNAmodR.ML for example)General ideas and experiences:
@Shians fyi: It is probably quite easy to turn RNAmodR into a modR package and extend a separate RNAmodR and DNAmodR package. Let me know, if that might be of interest to you.
Thanks Felix, I’m on vacation and will have a look at this next week to properly digest it. If memory may be a problem, I’ve been itching to learn some on-disk methods.
@Shians Due to the recent changes in the DataFrame
package, I changed the structure of the SequenceDataFrame
class. In theory, It supports now different backend as outligned @hpages in the recent Devtalk (see Slack). However, I didn't have any opportunity to test it, yet, because it still requires some change in the S4Vectors
package. You might want to stay tuned to changes coming up there, if you are still working on this. Once the listData
slot moves to the DFrame
class, different backends can be supported and combined with the SequenceDataFrame
class.
At the same time I introduced the RNAModifier
and DNAModifier
class to distinguish RNA and DNA detection strategies. Also, the SequenceDataFrame
class now has a seqtype()
getter and setter to change between RNA and DNA sequences. For more details have a look at the recent changes in RNAmodR
package.
Introduction
I am a new PhD Student at the Walter and Eliza Hall institute in Melbourne, Australia. My project is based around methods and tools for the analysis of DNA methylation in long reads using Oxford Nanopore sequencers. My formal background is in statistics but I mainly work on developing software and have a keen interest in efficient and user-friendly computational methods and visualisation.
Expected attendees
Researchers who are interested in base modifications of all kinds, I am interested in DNA but the developed structure should equally support RNA modifications.
Should it be held during Developer Day
Probably
Description of the topic
(Will update this section after I do some more research and take suggestions)
I think there are things to keep in mind for this:
As far as I'm aware there's not a specialised widely supported Bioconductor structure for storing base modification information that also facilitates straightforward querying of common issues. The basics would be to ask for the methylation proportions in a specific region, there should be metadata within objects to separate groups for which this can be asked as well as reporting of coverage at the loci. Additionally it would be useful to query within-read methylation patterns, to inspect correlation between methylation sites within molecules. Compactness of representation is also going to be important, sparse or on-disk representations would be useful to consider, features and query performance probably take second place to storage size.
Desired outcome
I'd like to establish a set of queries of interest and a general abstract idea of what data structure(s) might be appropriate.