Closed m-swirski closed 4 years ago
I needed splice-aware shifting function compatible with ORFik very badly, so I just implemented one (I didn't account for softclipping):
applyShifts <- function(footprints, shifts) { selected_shifts <- shifts$offsets_start selected_lengths <- shifts$fraction footprints <- footprints[qwidth(footprints) %in% selected_lengths] lengthsAll <- qwidth(footprints) shiftsAll <- -selected_shifts[match(lengthsAll, selected_lengths)] + 1 shiftsAll[which(footprints@strand == "-")] <- -shiftsAll[which(footprints@strand == "-")] footprints <- qnarrow(footprints, shiftsAll, shiftsAll) footprints <- GRanges(footprints) footprints$size <- lengthsAll return(footprints) }
It shifts reads correctly, but brings another issue; ORFik is very powerful and as stated in the documentation: "The focus of ORFik for development is to be a Swiss army knife for transcriptomics" Which it has a potential to become. Using ideas from ORFik and development in accordance to it is, however, somewhat impeded by strict use of regular functions rather than class-specific methods of defined generic functions. What I mean is that ideally there would be "shiftFootprints" generic function defined in the shiftFootprints.R script. In the case of bioconductor core packages its syntax is universally:
setGeneric("newfunction", function(x) standardGeneric("newfunction"))
But it leaves little room for applying the function for classes on which no method of such name is defined (i.e. if you want exactly the same function deployed for more than one class, you have to set a method for each class) It can be dealt with by defining general function and then referring to it in multiple short methods or defining default function, in this case it would be:
setGeneric("shiftFootprints", function(footprints, shifts) {your function})
The benefit of doing it this way is that any further developer can easily modify/reimplement your function for novel, package-specific, classes; in my case, I would simply have to:
setMethod("shiftFootprints", signature("GAlignments", "data.table"), function(x,y) applyShifts(x,y))
Thus avoiding redundant naming (it can be really hard to get grasp of a new package when all function names are new, while these could be simply methods named as usual but defined on a new class) There is more on that here: http://adv-r.had.co.nz/S4.html
Sorry for giving advices/suggestions - I realize rebuilding the package so it follows structured generic functions definitions and setting methods for each class is a lot of hassle at this point, but it makes it much easier to develop based on the package down the line, what I think will be very important in this case. ORFik extends capabilities of bioconductor so much that it honestly should become one of its core package.
Best wishes, Michał
Okay, I have some more issues in regard to this thread:
1.I noticed is that shiftFootprints doesnt work on GRanges objects, only on GAlignments (so far ungapped), yet the only way to dereplicate reads (collapse) is convertToOneBasedRanges - is it done so on purpose? (you can do the convertToOneBasedRanges(addScoreColumn=TRUE) afterwards)
shiftFootprints_usemcols <- function (footprints, shifts, use.mcols=TRUE) { if (!is(shifts, "data.frame")) stop("shifts must be data.frame/data.table") if (nrow(shifts) == 0) stop("No shifts found in data.frame") if (is.null(shifts$fraction) | is.null(shifts$offsets_start)) stop("Either fraction or offsets_start column in shifts is not set!") selected_lengths <- shifts$fraction selected_shifts <- shifts$offsets_start lengthsAll <- readWidths(footprints, along.reference = TRUE) validLengths <- lengthsAll %in% selected_lengths lengthsAll <- lengthsAll[validLengths] footprints <- footprints[validLengths] cigarsAll <- cigar(footprints) shiftsAll <- -selected_shifts[chmatch(as.character(lengthsAll), as.character(selected_lengths))] is_pos <- strandBool(footprints) starts <- rep(NA, length(cigarsAll)) ends <- starts starts[is_pos] <- shiftsAll[is_pos] + 1 ends[!is_pos] <- lengthsAll[!is_pos] - shiftsAll[!is_pos] shiftedCigar <- cigarNarrow(cigarsAll, starts, ends) pos <- start(footprints) + unlist(attributes(shiftedCigar), use.names = FALSE) shifted <- GAlignments(seqnames(footprints), pos = pos, cigar = shiftedCigar, strand = strand(footprints)) shifted <- convertToOneBasedRanges(shifted) if (use.mcols) mcols(shifted) <- mcols(footprints) #added line shifted$size <- lengthsAll message("Sorting shifted footprints...") shifted <- sortSeqlevels(shifted) shifted <- sort(shifted) return(shifted) }
library(ORFik) library(data.table)
applyShifts <- function(footprints, shifts) { selected_shifts <- shifts$offsets_start selected_lengths <- shifts$fraction footprints <- footprints[readWidths(footprints) %in% selected_lengths] lengthsAll <- readWidths(footprints) shiftsAll <- -selected_shifts[match(lengthsAll, selected_lengths)] + 1 shiftsAll[which(footprints@strand == "-")] <- -shiftsAll[which(footprints@strand == "-")] shifted <- qnarrow(footprints, shiftsAll, shiftsAll) shifted <- GRanges(shifted) shifted$size <- lengthsAll message("Sorting shifted footprints...") shifted <- sortSeqlevels(shifted) shifted <- sort(shifted) return(shifted) }
bam_file <- system.file("extdata", "ribo-seq.bam", package = "ORFik") footprints <- readBam(bam_file) gtf_file <- system.file("extdata", "annotations.gtf", package = "ORFik") txdb <- loadTxdb(gtf_file) tx <- exonsBy(txdb, by = "tx", use.names = TRUE) cds <- cdsBy(txdb, by = "tx", use.names = TRUE) footprints@elementMetadata$id <- 1:length(footprints) #sorting somehow works differently for our functions, because results are different for gapped reads, yet I want to be able to match initial reads correctly. shifts <- detectRibosomeShifts(footprints, txdb, stop = TRUE) shiftedFootprints_classic <- shiftFootprints(footprints, shifts) shiftedFootprints_withmcols <- shiftFootprints_usemcols(footprints,shifts)
shiftedFootprints_withmcols@elementMetadata$id <- NULL identical(shiftedFootprints_withmcols,shiftedFootprints_classic) #TRUE shiftedFootprints_withmcols <- shiftFootprints_usemcols(footprints,shifts) #lt's still neeeded shiftedFootprints_alternativemethod <- applyShifts(footprints,shifts) identical(shiftedFootprints_alternativemethod$id,shiftedFootprints_withmcols$id) sf1 <- shiftedFootprints_withmcols #make variables shorter sf2 <- shiftedFootprints_alternativemethod sf1_matched <- sf1[match(sf2$id,sf1$id)] identical(sf1_matched$id, sf2$id) identical(sf1_matched, sf2) noequal <- sf1_matched$id[which(sf1_matched != sf2)] ftprints_nonequal <- footprints[noequal] ftprints_equal <- footprints[-noequal] which(qwidth(ftprints_nonequal) == width(ftprints_nonequal)) # all non-matchers are gapped ftprints_equal[which(qwidth(ftprints_equal) != width(ftprints_equal))] # all gapped matchers have their 5' matching block longer than abs(shift) sf1_nmatchers <- shiftFootprints_usemcols(ftprints_nonequal, shifts) sf2_nmatchers <- applyShifts(ftprints_nonequal, shifts) identical(sf2_nmatchers$id,ftprints_nonequal@elementMetadata$id) # TRUE, the order is kept now identical(sf1_nmatchers$id,ftprints_nonequal@elementMetadata$id) #TRUE
sf1_nmatchers sf2_nmatchers ftprints_nonequal
Does it work for you? Additionally check system.time of both methods. Not that crucial in this case, but still, roughly 2 fold faster for large datasets.
I was wondering about the clipping issues - the S or H flags would normally happen only if reads were not aligned in end-to-end mode. In ribo-seq it would happen rarely, I guess, and if at all then maybe it would be easier to remove corresponding cooridnates from reads (trim the S or H) and not bother with it afterwards?
Working on this now, will see if it is not correct.
About ORFik as part of bioconductor core, I am talking to Martin Morgan about this, but not sure if I should subset part of the package out then, but that is up to him what he thinks is best.
You are correct, ORFik is supposed to handle spliced / gapped reads and this was purely a test I made that should not have been updated into branch, since it used to use qnarrow which is the correct function.
I will update it now, and add a sort argument, so you can choose to sort output or not, for easier validation of bugs. I added your test data to tests that must be passed in ORFik
Should have a new update by 30 minutes.
Check out latest branch on github, should now work as expected. Let me know if there is anything else relating to this issue.
I’ll be back from vacation on Thursday, will check it out then.
Current tests shows this is now working correctly.
Hey, as in the title: I found that the function doesn't shift reads correctly, it is kind of restricted.
footprints.txt
library(ORFik) library(data.table)
dataframeAsGAlignments <- function(x) { GAlignments(seqnames = as.factor(x$seqnames), cigar = x$cigar,pos = x$start, strand = factor(x$strand, levels = c("+", "-", "*"))) } # why GAlignments(data.frame(gal)) doesnt return initial object is not understandable, but it's a fixer.
footprints <- read.table("footprints.txt") footprints <- dataframeAsGAlignments(footprints) shift12 <- shiftFootprints(footprints, data.table(fraction = 29, offsets_start = -12)) shift5 <- shiftFootprints(footprints, data.table(fraction = 29, offsets_start = -5)) identical(shift12[2], shift5[2]) # returns TRUE
shift21 <- shiftFootprints(footprints, data.table(fraction = 29, offsets_start = -21)) shift24 <- shiftFootprints(footprints, data.table(fraction = 29, offsets_start = -24)) identical(shift21[1], shift24[1]) #returns TRUE
The problem seems symmetrical for + and - strand, I didn't go as far in the source code to resolve the issue, but perhaps using qnarrow() would be the easiest thing to do here?