PeteHaitch / cometh

An R package with tools for analysing, managing and visualising co-methylation data. Loosely speaking, co-methylation is the correlation structure of DNA methylation.
0 stars 1 forks source link

Design of `CoMeth` class #3

Open PeteHaitch opened 10 years ago

PeteHaitch commented 10 years ago

There is a useful thread on the Bioc-Devel mailing list where Martin Morgan describes the design of a class that extends the SummarizedExperiment class. He also discusses the concept of reference classes, and how and when these are useful, and the creation of generators and generics for the new class.

I need to read and understand this material. My current implementation of the CoMeth class, which builds upon SummarizedExperiment, does some non-standard things. One idea would be to most the pos fields from the assays slot and into a separate (reference class?) field in the CoMeth object.

PeteHaitch commented 10 years ago

Another discussion of SummarizedExperiment on Bioc-Devel

PeteHaitch commented 10 years ago

Tim Triche Jr stuffs a dgcMatrix inside a SummarizedExperiment in this Bioc-Devel post. The initial post suggests there are problems with this approach however (this follow-up post)[https://stat.ethz.ch/pipermail/bioc-devel/attachments/20120905/6f26ee8d/attachment.pl] suggests it does in fact work. The entire thread is very informative.

PeteHaitch commented 10 years ago

Tim suggests addSeqInfo() in this post

PeteHaitch commented 10 years ago

Kasper Hansen asks about a data structure that is a matrix with Rle columns in this thread. I've wondered this myself.

PeteHaitch commented 10 years ago

Some discussion of cbind-ing SummarizedExperiment objects in this thread.

PeteHaitch commented 10 years ago

A quick test suggest that an Rle based encoding of the counts data will be most efficient. This will probably depend on the m in m-tuples, as larger values of m result in sparser counts data.

library(Matrix)
library(GenomicRanges)
a <- gzfile('tmp.tsv.gz') # tmp.tsv.gz contains ~3.8M CG 3-tuples
b <- read.table(a, header = T, sep = '\t', colClasses = c('character', rep('integer', 11)))
d <- as.matrix(b[, 5:12]) # About 115 Mb
d_M <- Matrix(d) # About 46 Mb
d_l <- RleList(lapply(1:8, function(i, d){Rle(d[, i])}, d = d)) # About 24 Mb
PeteHaitch commented 10 years ago

Emailed Bioc-Devel to ask for advice on the design of the CoMeth class.

PeteHaitch commented 10 years ago

Martin Morgan and Tim Triche Jr. replied with suggestions. I need to do some basic benchmarking of these ideas before proceeding.

PeteHaitch commented 10 years ago

Some notes comparing the classes CoMeth and Tuples.

CoMeth

Tuples class suggested by Martin Morgan

PeteHaitch commented 10 years ago

Have decided to base the CoMeth on SummarizedExperiment. Basically, all the pos data get stored in the rowData and all the counts data get stored in the assays.

PeteHaitch commented 10 years ago

There are several complications introduced by using SummarizedExperiment-based CoMeth class (see https://mailman.stat.ethz.ch/pipermail/bioconductor/2014-March/058487.html for details.

Instead, I will design a Tuples class that stores pos (chr strand pos1 ... posm) and then design a class MTuples that includes the Tuples class along with the count data as matrices.