VCCRI / Sierra

Discover differential transcript usage from polyA-captured single cell RNA-seq data
GNU General Public License v3.0
49 stars 17 forks source link

Error in CountPeaks() #35

Closed fciamponi closed 3 years ago

fciamponi commented 3 years ago

Hello Sierra team,

I was hoping to get some help regarding an issue (see below) that I'm having when running Sierra CountPeaks with a specific sample (https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR13043563 and SRR13043565). The pipeline works fine with both the vignette dataset and with other samples from the same experiment/patient (ex. SRR13043564).

I've tried some of the common suggestions found in other issues (ex. checking my gtf annotations, barcode file, bam file index and etc...) but so far nothing worked.

I'm using cellranger's annotation as a reference (refdata-gex-GRCh38-2020-A) and mapped my samples with STAR aligner (since the original experiment is MARS-seq) and followed STARsolo guidelines to make it as close to cellranger as possible.

Thanks in advance for any help you can provide.

Best regards, Felipe

Error in {: task 32 failed - "1 elements in value to replace 0 elements" Traceback:

  1. CountPeaks(peak.sites.file = peak.merge.output.file, gtf.file = reference.file, . bamfile = bam.file, whitelist.file = whitelist.bc.file, output.dir = outfile, . countUMI = TRUE, ncores = 16)
  2. foreach::foreach(each.chr = chr.names, .combine = "rbind", .packages = c("magrittr")) %dopar% . { . mat.per.chr <- c() . message("Processing chr: ", each.chr) . for (strand in c(1, -1)) { . message(" and strand ", strand) . isMinusStrand <- if (strand == 1) . FALSE . else TRUE . peak.sites.chr <- dplyr::filter(peak.sites, Chr == . each.chr & Strand == strand) %>% dplyr::select(Gene, . Chr, Fit.start, Fit.end, Strand) . peak.sites.chr$Fit.start <- as.integer(peak.sites.chr$Fit.start) . peak.sites.chr$Fit.end <- as.integer(peak.sites.chr$Fit.end) . peak.sites.chr <- dplyr::filter(peak.sites.chr, Fit.start < . Fit.end) . if (nrow(peak.sites.chr) == 0) { . next . } . isMinusStrand <- if (strand == 1) . FALSE . else TRUE . which <- GenomicRanges::GRanges(seqnames = each.chr, . ranges = IRanges::IRanges(1, max(peak.sites.chr$Fit.end))) . param <- Rsamtools::ScanBamParam(tag = c(CBtag, UMItag), . which = which, flag = Rsamtools::scanBamFlag(isMinusStrand = isMinusStrand)) . aln <- GenomicAlignments::readGAlignments(bamfile, . param = param) . nobarcodes <- which(unlist(is.na(GenomicRanges::mcols(aln)[CBtag]))) . noUMI <- which(unlist(is.na(GenomicRanges::mcols(aln)[UMItag]))) . to.remove <- dplyr::union(nobarcodes, noUMI) . if (length(to.remove) > 0) { . aln <- aln[-to.remove] . } . whitelist.pos <- which(unlist(GenomicRanges::mcols(aln)[CBtag]) %in% . whitelist.bc) . aln <- aln[whitelist.pos] . if (countUMI) { . GenomicRanges::mcols(aln)$CBUB <- paste0(unlist(GenomicRanges::mcols(aln)[CBtag]), . "", unlist(GenomicRanges::mcols(aln)[UMItag])) . uniqUMIs <- which(!duplicated(GenomicRanges::mcols(aln)$CB_UB)) . aln <- aln[uniqUMIs] . } . aln <- GenomicRanges::split(aln, unlist(GenomicRanges::mcols(aln)[CBtag])) . polyA.GR <- GenomicRanges::GRanges(seqnames = peak.sites.chr$Chr, . IRanges::IRanges(start = peak.sites.chr$Fit.start, . end = as.integer(peak.sites.chr$Fit.end))) . n.polyA <- length(polyA.GR) . barcodes.gene <- names(aln) . res <- sapply(barcodes.gene, function(x) GenomicRanges::countOverlaps(polyA.GR, . aln[[x]])) . res.mat <- matrix(0L, nrow = n.polyA, ncol = n.bcs) . res.mat[, match(barcodes.gene, whitelist.bc)] <- res . mat.per.strand <- Matrix::Matrix(res.mat, sparse = TRUE) . polyA.ids <- paste0(peak.sites.chr$Gene, ":", peak.sites.chr$Chr, . ":", peak.sites.chr$Fit.start, "-", peak.sites.chr$Fit.end, . ":", peak.sites.chr$Strand) . rownames(mat.per.strand) <- polyA.ids . if (is.null(mat.per.chr)) { . mat.per.chr <- mat.per.strand . } . else { . mat.per.chr <- rbind(mat.per.chr, mat.per.strand) . } . } . return(mat.per.chr) . }
  3. e$fun(obj, substitute(ex), parent.frame(), e$data)
rj-patrick commented 3 years ago

Hi @Fciamponi,

Thanks for the information. The easiest thing would be if we could reproduce the error on our end. If you were to take a subset of the bam does it still produce the error? If so, would you please send it (along with the peak whitelist files) and we can work out what's going on.

Cheers, Ralph

fciamponi commented 3 years ago

Thanks for the quick reply @Reprobate.

Yes, I reproduced the error with a small subset of the original bam file (with just 10% of total reads). I'm including the sample here, along with the barcodes and merged peaks information.

However, I also noticed that the same error occured when I create randomized subsets for previsously working .bam files (such as the case with SRR13043564). I then proceeded to create a subset that contains 20% of the reads from each cell (instead of drawing 10~50% of reads randomly from the original bam file) but the problem persisted.

I've uploaded the files to a google drive folder (https://drive.google.com/drive/folders/1IhiZJ_86R2MeEgdT_M6Jaxnz_qssAHbK?usp=sharing). In this case, SRR13043564 runs without issues but SRR13043563 gives the error I mentioned in the opening comment.

In both cases, my command line was (just switching replacing 63-to-64 when necessary):

dir.create('./peakCountsTest/',recursive = TRUE)

whitelist.bc.file <- ./'CellBarcodes.COHEN2021.unique.txt' reference.file <- './genes.gtf' bam.file <- '../SRR13043564.bam' sample <- 'SRR13043564'

outfile <- paste0('./peakCountsTest/',sample,'.counts')

peak.merge.output.file <- "./mergedPeaks.txt"

CountPeaks(peak.sites.file = peak.merge.output.file, gtf.file = reference.file, bamfile = bam.file, whitelist.file = whitelist.bc.file, output.dir = outfile, countUMI = TRUE, ncores = 16)

Best, Felipe

PS. Sorry for taking so long to reply, it was a very busy week.

rj-patrick commented 3 years ago

Hi Felipe,

Thanks for the data, it was very helpful for debugging. The issue was traced to an evidently rare situation where there was a single peak called in a chromosome, but with no barcodes in the whitelist. I've added a check for this situation which should fix it - I'm now able to run CountPeaks on your example data without error. Let me know if you have any more issues.

Cheers, Ralph

fciamponi commented 3 years ago

Thanks, it worked perfectly now!

Sorry for the inconvenience.

Best, Felipe

rj-patrick commented 3 years ago

No worries, thanks for helping us find that bug!

Cheers, Ralph