Open m-swirski opened 4 years ago
Hey, thanks for these tests, I just got married, so on honeymoon until 15th of August. Will check it out then.
Congratulations! Sure, I'll just keep opening new threads and resolving problems on my own, and you'll see to it then.
I think any subsetting on names should be filtered before you tile the grl, And since you get the gene index, you can easily add it. Do you have a specific use case where it makes sense to keep the names in the tiled table ?
Hmm, my rationale was just that keeping names, instead of 'rank' numbers would make it easier to spot mistakes or the fact that genes got shuffled along the way - it can happen pretty easily while having GTF and TxDb based on it, because TxDb sorts by seqnames and strand (I don' really know why):
#pseudocode, just to show how it usually looks
strand(unlist(exons))
factor-Rle of length 5698 with 17 runs
Lengths: 470 468 476 442 410 387 372 353 340 372 306 299 227 272 223 240 41
Values : + - + - + - + - + - + - + - + - +
Levels(3): + - *
strand(gtf_gr[gtf_gr$type == "exon"])
factor-Rle of length 5698 with 2975 runs
Lengths: 4 2 2 3 2 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 ... 1 1 3 1 1 2 2 1 2 5 1 4 2 2 1 1 1 6 3 41
Values : + - + - + - + - + - + - + - + - + - + - ... - + - + - + - + - + - + - + - + - + - +
Levels(3): + - *
And I think that in some cases subsetting by after using coveragePerTiling may be helpful, i.e. without as.data.table = TRUE, you just subset after tiling:
#pseudocode again
covs <- coveragePerTiling(grl, reads)
covs["mygene"]
and with data table it would be very similar if names were present:
covs <- coveragePerTiling(grl, reads, as.data.table = TRUE)
covs[genes == "mygene",]
It's a minor issue, of course, I just thought that having the names would be beneficial, given that a factor column is not significantly heavier than integer column.
Will keep this open, but right now I think it's not worth implementing. Since it might lead to bugs downstream
I've seen some more compelling reasons to have this as an option, will do some tests on how it affects size & speed
I observed no significant difference in performance when having gene names as factor (as in the function in the first post)
We now allow names as long as drop.zero.dt is TRUE, and this is documented in the function. Still considering allowing it when drop.zero.dt is TRUE.
Hi, is it a desired behavior? Size of resulting data.table is very similar if you keep names as factor.
library(ORFik) attach(loadNamespace(ORFik)
only change is the if statement for keep.names within as.data.table if statement.
cpt_nm <- function (grl, reads, is.sorted = FALSE, keep.names = TRUE, as.data.table = FALSE, withFrames = FALSE, weight = "score") { if (!is.sorted) grl <- sortPerGroup(grl) if (is.numeric(weight) | (weight[1] %in% colnames(mcols(reads)))) { coverage <- coverageByTranscriptW(reads, grl, weight = weight) } else coverage <- coverageByTranscript(reads, grl) if (!keep.names) names(coverage) <- NULL if (as.data.table) { window_size <- unique(widthPerGroup(grl, FALSE)) count <- data.table(count = unlist(IntegerList(coverage), use.names = FALSE)) if (keep.names){ count[,
:=
(genes,factor(rep(names(grl), widthPerGroup(grl))))] } else count[,:=
(genes, groupings(coverage))] if (length(window_size) != 1) { count[,:=
(ones, rep.int(1L, length(genes)))] count[,:=
(position, cumsum(ones)), by = genes] count$ones <- NULL } else { count[,:=
(position, rep.int(seq.int(window_size), length(coverage)))] } if (withFrames) { count[,:=
(frame, (position - 1)%%3)] } count[] return(count) } return(coverage) }bam_file <- system.file("extdata", "ribo-seq.bam", package = "ORFik") footprints <- readBam(bam_file) gtf_file <- system.file("extdata", "annotations.gtf", package = "ORFik") txdb <- loadTxdb(gtf_file) tx <- exonsBy(txdb, by = "tx", use.names = TRUE)
result1 <- coveragePerTiling(tx, footprints, as.data.table = TRUE) result2 <- cpt_nm(tx, footprints, as.data.table = TRUE, keep.names = FALSE) result3 <- cpt_nm(tx, footprints, as.data.table = TRUE, keep.names = TRUE) identical(result1,result2) object.size(result1) object.size(result3)