Open PeteHaitch opened 10 years ago
Tim Triche Jr stuffs a dgcMatrix
inside a SummarizedExperiment
in this Bioc-Devel post. The initial post suggests there are problems with this approach however (this follow-up post)[https://stat.ethz.ch/pipermail/bioc-devel/attachments/20120905/6f26ee8d/attachment.pl] suggests it does in fact work. The entire thread is very informative.
Tim suggests addSeqInfo()
in this post
Kasper Hansen asks about a data structure that is a matrix with Rle
columns in this thread. I've wondered this myself.
Some discussion of cbind
-ing SummarizedExperiment
objects in this thread.
A quick test suggest that an Rle
based encoding of the counts
data will be most efficient. This will probably depend on the m
in m-tuples, as larger values of m
result in sparser counts
data.
library(Matrix)
library(GenomicRanges)
a <- gzfile('tmp.tsv.gz') # tmp.tsv.gz contains ~3.8M CG 3-tuples
b <- read.table(a, header = T, sep = '\t', colClasses = c('character', rep('integer', 11)))
d <- as.matrix(b[, 5:12]) # About 115 Mb
d_M <- Matrix(d) # About 46 Mb
d_l <- RleList(lapply(1:8, function(i, d){Rle(d[, i])}, d = d)) # About 24 Mb
Emailed Bioc-Devel
to ask for advice on the design of the CoMeth
class.
Martin Morgan and Tim Triche Jr. replied with suggestions. I need to do some basic benchmarking of these ideas before proceeding.
Bioc-Devel
thanking them for their responses.Some notes comparing the classes CoMeth
and Tuples
.
CoMeth
CoMeth
objectTuples
class suggested by Martin MorganTuples
objectTuplesList
Have decided to base the CoMeth
on SummarizedExperiment
. Basically, all the pos
data get stored in the rowData
and all the counts
data get stored in the assays
.
There are several complications introduced by using SummarizedExperiment
-based CoMeth
class (see https://mailman.stat.ethz.ch/pipermail/bioconductor/2014-March/058487.html for details.
Instead, I will design a Tuples
class that stores pos
(chr
strand
pos1
...
posm
) and then design a class MTuples
that includes the Tuples
class along with the count
data as matrices.
There is a useful thread on the Bioc-Devel mailing list where Martin Morgan describes the design of a class that extends the
SummarizedExperiment
class. He also discusses the concept of reference classes, and how and when these are useful, and the creation ofgenerators
andgenerics
for the new class.I need to read and understand this material. My current implementation of the
CoMeth
class, which builds uponSummarizedExperiment
, does some non-standard things. One idea would be to most thepos
fields from theassays
slot and into a separate (reference class?) field in theCoMeth
object.