Roleren / ORFik

MIT License
32 stars 9 forks source link

coveragePerTiling(..., as.data.table = TRUE, keep.names = TRUE) strips names #100

Open m-swirski opened 4 years ago

m-swirski commented 4 years ago

Hi, is it a desired behavior? Size of resulting data.table is very similar if you keep names as factor.

library(ORFik) attach(loadNamespace(ORFik)

only change is the if statement for keep.names within as.data.table if statement.

cpt_nm <- function (grl, reads, is.sorted = FALSE, keep.names = TRUE, as.data.table = FALSE, withFrames = FALSE, weight = "score") { if (!is.sorted) grl <- sortPerGroup(grl) if (is.numeric(weight) | (weight[1] %in% colnames(mcols(reads)))) { coverage <- coverageByTranscriptW(reads, grl, weight = weight) } else coverage <- coverageByTranscript(reads, grl) if (!keep.names) names(coverage) <- NULL if (as.data.table) { window_size <- unique(widthPerGroup(grl, FALSE)) count <- data.table(count = unlist(IntegerList(coverage), use.names = FALSE)) if (keep.names){ count[, :=(genes,factor(rep(names(grl), widthPerGroup(grl))))] } else count[, :=(genes, groupings(coverage))] if (length(window_size) != 1) { count[, :=(ones, rep.int(1L, length(genes)))] count[, :=(position, cumsum(ones)), by = genes] count$ones <- NULL } else { count[, :=(position, rep.int(seq.int(window_size), length(coverage)))] } if (withFrames) { count[, :=(frame, (position - 1)%%3)] } count[] return(count) } return(coverage) }

bam_file <- system.file("extdata", "ribo-seq.bam", package = "ORFik") footprints <- readBam(bam_file) gtf_file <- system.file("extdata", "annotations.gtf", package = "ORFik") txdb <- loadTxdb(gtf_file) tx <- exonsBy(txdb, by = "tx", use.names = TRUE)

result1 <- coveragePerTiling(tx, footprints, as.data.table = TRUE) result2 <- cpt_nm(tx, footprints, as.data.table = TRUE, keep.names = FALSE) result3 <- cpt_nm(tx, footprints, as.data.table = TRUE, keep.names = TRUE) identical(result1,result2) object.size(result1) object.size(result3)

Roleren commented 4 years ago

Hey, thanks for these tests, I just got married, so on honeymoon until 15th of August. Will check it out then.

m-swirski commented 4 years ago

Congratulations! Sure, I'll just keep opening new threads and resolving problems on my own, and you'll see to it then.

Roleren commented 4 years ago

I think any subsetting on names should be filtered before you tile the grl, And since you get the gene index, you can easily add it. Do you have a specific use case where it makes sense to keep the names in the tiled table ?

m-swirski commented 4 years ago

Hmm, my rationale was just that keeping names, instead of 'rank' numbers would make it easier to spot mistakes or the fact that genes got shuffled along the way - it can happen pretty easily while having GTF and TxDb based on it, because TxDb sorts by seqnames and strand (I don' really know why):

#pseudocode, just to show how it usually looks
strand(unlist(exons))
factor-Rle of length 5698 with 17 runs
  Lengths: 470 468 476 442 410 387 372 353 340 372 306 299 227 272 223 240  41
  Values :   +   -   +   -   +   -   +   -   +   -   +   -   +   -   +   -   +
Levels(3): + - *
strand(gtf_gr[gtf_gr$type == "exon"])
factor-Rle of length 5698 with 2975 runs
  Lengths:  4  2  2  3  2  1  1  1  1  2  1  2  1  1  1  1  1  1  1  1 ...  1  1  3  1  1  2  2  1  2  5  1  4  2  2  1  1  1  6  3 41
  Values :  +  -  +  -  +  -  +  -  +  -  +  -  +  -  +  -  +  -  +  - ...  -  +  -  +  -  +  -  +  -  +  -  +  -  +  -  +  -  +  -  +
Levels(3): + - *

And I think that in some cases subsetting by after using coveragePerTiling may be helpful, i.e. without as.data.table = TRUE, you just subset after tiling:

#pseudocode again
covs <- coveragePerTiling(grl, reads)
covs["mygene"]

and with data table it would be very similar if names were present:

covs <- coveragePerTiling(grl, reads, as.data.table = TRUE)
covs[genes == "mygene",]

It's a minor issue, of course, I just thought that having the names would be beneficial, given that a factor column is not significantly heavier than integer column.

Roleren commented 4 years ago

Will keep this open, but right now I think it's not worth implementing. Since it might lead to bugs downstream

Roleren commented 4 years ago

I've seen some more compelling reasons to have this as an option, will do some tests on how it affects size & speed

m-swirski commented 4 years ago

I observed no significant difference in performance when having gene names as factor (as in the function in the first post)

Roleren commented 3 years ago

We now allow names as long as drop.zero.dt is TRUE, and this is documented in the function. Still considering allowing it when drop.zero.dt is TRUE.