Roleren / ORFik

MIT License
33 stars 9 forks source link

shiftFootprints on GAlignments reads doesn't work correctly for gapped reads #97

Closed m-swirski closed 4 years ago

m-swirski commented 4 years ago

Hey, as in the title: I found that the function doesn't shift reads correctly, it is kind of restricted.

footprints.txt

library(ORFik) library(data.table)

dataframeAsGAlignments <- function(x) { GAlignments(seqnames = as.factor(x$seqnames), cigar = x$cigar,pos = x$start, strand = factor(x$strand, levels = c("+", "-", "*"))) } # why GAlignments(data.frame(gal)) doesnt return initial object is not understandable, but it's a fixer.

footprints <- read.table("footprints.txt") footprints <- dataframeAsGAlignments(footprints) shift12 <- shiftFootprints(footprints, data.table(fraction = 29, offsets_start = -12)) shift5 <- shiftFootprints(footprints, data.table(fraction = 29, offsets_start = -5)) identical(shift12[2], shift5[2]) # returns TRUE

shift21 <- shiftFootprints(footprints, data.table(fraction = 29, offsets_start = -21)) shift24 <- shiftFootprints(footprints, data.table(fraction = 29, offsets_start = -24)) identical(shift21[1], shift24[1]) #returns TRUE

The problem seems symmetrical for + and - strand, I didn't go as far in the source code to resolve the issue, but perhaps using qnarrow() would be the easiest thing to do here?

m-swirski commented 4 years ago

I needed splice-aware shifting function compatible with ORFik very badly, so I just implemented one (I didn't account for softclipping):

applyShifts <- function(footprints, shifts) { selected_shifts <- shifts$offsets_start selected_lengths <- shifts$fraction footprints <- footprints[qwidth(footprints) %in% selected_lengths] lengthsAll <- qwidth(footprints) shiftsAll <- -selected_shifts[match(lengthsAll, selected_lengths)] + 1 shiftsAll[which(footprints@strand == "-")] <- -shiftsAll[which(footprints@strand == "-")] footprints <- qnarrow(footprints, shiftsAll, shiftsAll) footprints <- GRanges(footprints) footprints$size <- lengthsAll return(footprints) }

It shifts reads correctly, but brings another issue; ORFik is very powerful and as stated in the documentation: "The focus of ORFik for development is to be a Swiss army knife for transcriptomics" Which it has a potential to become. Using ideas from ORFik and development in accordance to it is, however, somewhat impeded by strict use of regular functions rather than class-specific methods of defined generic functions. What I mean is that ideally there would be "shiftFootprints" generic function defined in the shiftFootprints.R script. In the case of bioconductor core packages its syntax is universally:

setGeneric("newfunction", function(x) standardGeneric("newfunction"))

But it leaves little room for applying the function for classes on which no method of such name is defined (i.e. if you want exactly the same function deployed for more than one class, you have to set a method for each class) It can be dealt with by defining general function and then referring to it in multiple short methods or defining default function, in this case it would be:

setGeneric("shiftFootprints", function(footprints, shifts) {your function})

The benefit of doing it this way is that any further developer can easily modify/reimplement your function for novel, package-specific, classes; in my case, I would simply have to:

setMethod("shiftFootprints", signature("GAlignments", "data.table"), function(x,y) applyShifts(x,y))

Thus avoiding redundant naming (it can be really hard to get grasp of a new package when all function names are new, while these could be simply methods named as usual but defined on a new class) There is more on that here: http://adv-r.had.co.nz/S4.html

Sorry for giving advices/suggestions - I realize rebuilding the package so it follows structured generic functions definitions and setting methods for each class is a lot of hassle at this point, but it makes it much easier to develop based on the package down the line, what I think will be very important in this case. ORFik extends capabilities of bioconductor so much that it honestly should become one of its core package.

Best wishes, Michał

m-swirski commented 4 years ago

Okay, I have some more issues in regard to this thread:

1.I noticed is that shiftFootprints doesnt work on GRanges objects, only on GAlignments (so far ungapped), yet the only way to dereplicate reads (collapse) is convertToOneBasedRanges - is it done so on purpose? (you can do the convertToOneBasedRanges(addScoreColumn=TRUE) afterwards)

  1. shiftFootprints(footprints,shifts) strips mcols(footprints), that may be inconvenient. It's enough to add one line to shiftFootprints, to avoid it. I'll do that in an example below, because reads sorting works differently when new GAlignments object is created in the shiftFootprints function (that's another thing - creating new object may strip more features than just the elementMetadata;

shiftFootprints_usemcols <- function (footprints, shifts, use.mcols=TRUE) { if (!is(shifts, "data.frame")) stop("shifts must be data.frame/data.table") if (nrow(shifts) == 0) stop("No shifts found in data.frame") if (is.null(shifts$fraction) | is.null(shifts$offsets_start)) stop("Either fraction or offsets_start column in shifts is not set!") selected_lengths <- shifts$fraction selected_shifts <- shifts$offsets_start lengthsAll <- readWidths(footprints, along.reference = TRUE) validLengths <- lengthsAll %in% selected_lengths lengthsAll <- lengthsAll[validLengths] footprints <- footprints[validLengths] cigarsAll <- cigar(footprints) shiftsAll <- -selected_shifts[chmatch(as.character(lengthsAll), as.character(selected_lengths))] is_pos <- strandBool(footprints) starts <- rep(NA, length(cigarsAll)) ends <- starts starts[is_pos] <- shiftsAll[is_pos] + 1 ends[!is_pos] <- lengthsAll[!is_pos] - shiftsAll[!is_pos] shiftedCigar <- cigarNarrow(cigarsAll, starts, ends) pos <- start(footprints) + unlist(attributes(shiftedCigar), use.names = FALSE) shifted <- GAlignments(seqnames(footprints), pos = pos, cigar = shiftedCigar, strand = strand(footprints)) shifted <- convertToOneBasedRanges(shifted) if (use.mcols) mcols(shifted) <- mcols(footprints) #added line shifted$size <- lengthsAll message("Sorting shifted footprints...") shifted <- sortSeqlevels(shifted) shifted <- sort(shifted) return(shifted) }

  1. ORFik sample data contains gapped reads, so I could have actually provided an example and comparison with my function on that. Here it is,:

library(ORFik) library(data.table)

applyShifts <- function(footprints, shifts) { selected_shifts <- shifts$offsets_start selected_lengths <- shifts$fraction footprints <- footprints[readWidths(footprints) %in% selected_lengths] lengthsAll <- readWidths(footprints) shiftsAll <- -selected_shifts[match(lengthsAll, selected_lengths)] + 1 shiftsAll[which(footprints@strand == "-")] <- -shiftsAll[which(footprints@strand == "-")] shifted <- qnarrow(footprints, shiftsAll, shiftsAll) shifted <- GRanges(shifted) shifted$size <- lengthsAll message("Sorting shifted footprints...") shifted <- sortSeqlevels(shifted) shifted <- sort(shifted) return(shifted) }

shiftFootprints_usemcols <- function from above, lets leave the boilerplate

bam_file <- system.file("extdata", "ribo-seq.bam", package = "ORFik") footprints <- readBam(bam_file) gtf_file <- system.file("extdata", "annotations.gtf", package = "ORFik") txdb <- loadTxdb(gtf_file) tx <- exonsBy(txdb, by = "tx", use.names = TRUE) cds <- cdsBy(txdb, by = "tx", use.names = TRUE) footprints@elementMetadata$id <- 1:length(footprints) #sorting somehow works differently for our functions, because results are different for gapped reads, yet I want to be able to match initial reads correctly. shifts <- detectRibosomeShifts(footprints, txdb, stop = TRUE) shiftedFootprints_classic <- shiftFootprints(footprints, shifts) shiftedFootprints_withmcols <- shiftFootprints_usemcols(footprints,shifts)

check if shiftFootprints_usemcols works identically:

shiftedFootprints_withmcols@elementMetadata$id <- NULL identical(shiftedFootprints_withmcols,shiftedFootprints_classic) #TRUE shiftedFootprints_withmcols <- shiftFootprints_usemcols(footprints,shifts) #lt's still neeeded shiftedFootprints_alternativemethod <- applyShifts(footprints,shifts) identical(shiftedFootprints_alternativemethod$id,shiftedFootprints_withmcols$id) sf1 <- shiftedFootprints_withmcols #make variables shorter sf2 <- shiftedFootprints_alternativemethod sf1_matched <- sf1[match(sf2$id,sf1$id)] identical(sf1_matched$id, sf2$id) identical(sf1_matched, sf2) noequal <- sf1_matched$id[which(sf1_matched != sf2)] ftprints_nonequal <- footprints[noequal] ftprints_equal <- footprints[-noequal] which(qwidth(ftprints_nonequal) == width(ftprints_nonequal)) # all non-matchers are gapped ftprints_equal[which(qwidth(ftprints_equal) != width(ftprints_equal))] # all gapped matchers have their 5' matching block longer than abs(shift) sf1_nmatchers <- shiftFootprints_usemcols(ftprints_nonequal, shifts) sf2_nmatchers <- applyShifts(ftprints_nonequal, shifts) identical(sf2_nmatchers$id,ftprints_nonequal@elementMetadata$id) # TRUE, the order is kept now identical(sf1_nmatchers$id,ftprints_nonequal@elementMetadata$id) #TRUE

now it can be manually examined which method works correctly; I could do something for automized check, but applyShifts is pretty much it.

sf1_nmatchers sf2_nmatchers ftprints_nonequal

Does it work for you? Additionally check system.time of both methods. Not that crucial in this case, but still, roughly 2 fold faster for large datasets.

I was wondering about the clipping issues - the S or H flags would normally happen only if reads were not aligned in end-to-end mode. In ribo-seq it would happen rarely, I guess, and if at all then maybe it would be easier to remove corresponding cooridnates from reads (trim the S or H) and not bother with it afterwards?

Roleren commented 4 years ago

Working on this now, will see if it is not correct.

About ORFik as part of bioconductor core, I am talking to Martin Morgan about this, but not sure if I should subset part of the package out then, but that is up to him what he thinks is best.

Roleren commented 4 years ago

You are correct, ORFik is supposed to handle spliced / gapped reads and this was purely a test I made that should not have been updated into branch, since it used to use qnarrow which is the correct function.

I will update it now, and add a sort argument, so you can choose to sort output or not, for easier validation of bugs. I added your test data to tests that must be passed in ORFik

Should have a new update by 30 minutes.

Roleren commented 4 years ago

Check out latest branch on github, should now work as expected. Let me know if there is anything else relating to this issue.

m-swirski commented 4 years ago

I’ll be back from vacation on Thursday, will check it out then.

Roleren commented 4 years ago

Current tests shows this is now working correctly.