JEFworks-Lab / HoneyBADGER

HMM-integrated Bayesian approach for detecting CNV and LOH events from single-cell RNA-seq data
http://jef.works/HoneyBADGER/
GNU General Public License v3.0
95 stars 31 forks source link

error about "getCellAlleleCount" command line #17

Closed zhuxqdoctor closed 4 years ago

zhuxqdoctor commented 5 years ago

Hi, I am trying to use HonryBADGER to call CNV from 10X Genomics RNAseq data. Initially, I used the snps matrix from the HoneyBadger package and loaded my own bam, bai and barcode file to run

results <- getCellAlleleCount(snps, bamFiles, indexFiles, cellBarcodes) However, I got error like this: Error in .io_bam(.c_Pileup, file, bamReverseComplement(scanBamParam), : seqlevels(param) not in BAM header: seqlevels: ‘NC_007605’. I used the Rsamtools to check the header of BAM file and there was no "NC_007605" in the header. I have googled about NC_007605. It means "Human gammaherpesvirus 4, complete genome". So what does it work for here. How can I figure this out. Really appreciated if anyone can help. Thanks.

JEFworks commented 5 years ago

Hi,

The error is caused by your snps variable containing a genomic ranges coordinate with seqlevel ‘NC_007605’ that is not present in your bam.

Can you please try the getSnpMats10X() function instead if you have not done so already? I have previously encountered this issue and thought I had addressed it by filtering to relevant seqlevels in getSnpMats10X().

Also, please double check that your bam is aligned to the same genome version as your snps. Note the built-in ExAC snps are from hg19.

Best, Jean

helianthuszhu commented 5 years ago

Hi, Jean, Thanks for your reply. Based on your suggestion, I tried to use getSnpMats10X() command line, unfortunately also failed and got the same errors. I am sure my bam file were aligned from hg19 genome version which means it can be matched with the snps from ExAC just in the honeyBADGER package. So, do you have any other suggestion for me? Thanks.

JEFworks commented 5 years ago

Hum, in that case, please try the following:

# load ExAC chromosome 1 snps
load(system.file("ExAC", "ExAC_chr1.RData", package = "HoneyBADGER"))
head(snps)
seqlevels(snps)

# filter for only those with seqlevels in canonical chromosomes
chrs <- c(as.character(1:22), "X", "Y")
seqlevels(snps) <- chrs
head(snps)
seqlevels(snps)

Now you should be able to do

results <- getSnpMats10X(snps, bamFiles, indexFiles, barcodes)

If it works, I can update the getSnpMats10X() to do this by default.

helianthuszhu commented 5 years ago

Hi, Jean, Thanks again. This time I got a different error like this in the pileup step: [1] "Getting pileup..." Error in validObject(.Object) : invalid class “ScanBamParam” object: 'tagFilter' must contain only non-NULL, non-NA, non-empty character or integer values. Does it sound like the bam file is lack of something? Or should I filter or modify my bam file? Thanks.

JEFworks commented 5 years ago

Did you use CellRanger to get your bam? Do you know which version of CellRanger was used? It looks like your reads may be missing the 'CB' tag, which is used to assign each read to a cell barcode. Or perhaps CellRanger updated their tag name.

For example, here is a read from a 10X bam. Note the CB tag.

> samtools view aml027_post_transplant_possorted_genome_bam.bam | less
NB500915:165:HYLMFBGXX:3:23402:11145:13415      16      1       10544   3       68M30S  *       0       0       AAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCCCCATGTACTCTGCGTTGATACCACTGCTT      <EE/EEE6EEE6EEAE6EEA/AEE<EEE/EAEEEEE<EEAAEEAEEAAEAEEE/EEEEEEEEEEEEEEEEEEE6EAEEEEEAE/EEEEEEEEEAAAA/      NH:i:2  HI:i:1  AS:i:64 nM:i:1  NM:i:1  CR:Z:TAAAGACTTCTAGG     CQ:Z:AAAAA6EEEEEEEE     CB:Z:TAAAGACTTCTAGG-2   UR:Z:TCCTGTCGTT UQ:Z:AAAAAEEE/A UB:Z:TCCTGTCGTT BC:Z:CTATACGC   QT:Z:EEEAAAAA
helianthuszhu commented 5 years ago

Hi, Jean, Thanks for your reply. Yes, the bam file was derived from CellRanger (version, 2.0.1). Also I checked the bam file using samtools and the CB tag was in it. But it seemed a little different from your example. Here are the details. So is this the reason caused? C00135:251:CB8E5ANXX:8:1308:1504:73720 16 1 10551 255 70M31S * 0 0 TGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCAACTCCGCCCCCCATGTACTCTGCGTTGATACCACTGCTT /BF/B7//F<B77/77/7/FB/<77//F7/F7/</<77/F<FFFFB///7FFF7B/<FFFFFBB<BF7/<FBFFF<FFFFBFFFFB/FFFFFFFF</B</B NH:i:1 HI:i:1 AS:i:66 nM:i:1 RE:A:I CR:Z:CTTTGCGCAATGTTGC CY:Z:B/<BB/<FBBFFFF</ CB:Z:CTTTGCGCAATGTTGC-1 UR:Z:TTGAATCTCA UY:Z:FBB<FFFBF/ UB:Z:TTGAATCTCA BC:Z:AACGTCAA QT:Z:BBBBBFFF TR:Z:TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAAAAAATTTTTTTTAAAAAATTTTTTTTAAAAATAAATTATTTT TQ:Z:BFBFFFFFFFFFFFFFFFFFFFFFFFFB7B//B7B///BBB/<<7//77<</<BFB/<///////////////// RG:Z:SYL1_hg19:MissingLibrary:1:CB8E5ANXX:8

在 2018年12月3日,上午7:37,Jean Fan notifications@github.com 写道:

Did you use CellRanger to get your bam? Do you know which version of CellRanger was used? It looks like your reads may be missing the 'CB' tag, which is used to assign each read to a cell barcode. Or perhaps CellRanger updated their tag name.

For example, here is a read from a 10X bam. Note the CB tag.

samtools view aml027_post_transplant_possorted_genome_bam.bam | less NB500915:165:HYLMFBGXX:3:23402:11145:13415 16 1 10544 3 68M30S * 0 0 AAATCTGTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGAACGCCCCATGTACTCTGCGTTGATACCACTGCTT <EE/EEE6EEE6EEAE6EEA/AEE<EEE/EAEEEEE<EEAAEEAEEAAEAEEE/EEEEEEEEEEEEEEEEEEE6EAEEEEEAE/EEEEEEEEEAAAA/ NH:i:2 HI:i:1 AS:i:64 nM:i:1 NM:i:1 CR:Z:TAAAGACTTCTAGG CQ:Z:AAAAA6EEEEEEEE CB:Z:TAAAGACTTCTAGG-2 UR:Z:TCCTGTCGTT UQ:Z:AAAAAEEE/A UB:Z:TCCTGTCGTT BC:Z:CTATACGC QT:Z:EEEAAAAA — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JEFworks/HoneyBADGER/issues/17#issuecomment-443553186, or mute the thread https://github.com/notifications/unsubscribe-auth/AqcSaEcusojqyLzILZSh1a5aE75Ftzwhks5u1GREgaJpZM4Yazsi.

JEFworks commented 5 years ago

As long as the CB tag is present in all reads, it should be fine. To rule out the possibility of this being an issue with R/RSamtools, can you please try the python version for getting the allele counts information: https://github.com/barkasn/scAlleleCount

helianthuszhu commented 5 years ago

Hi, Jean, Thanks for your reply. I will try it and see if it works. Really appreciated for this.

Jean Fan notifications@github.com于2019年1月31日 周四上午3:27写道:

As long as the CB tag is present in all reads, it should be fine. To rule out the possibility of this being an issue with R/RSamtools, can you please try the python version for getting the allele counts information: https://github.com/barkasn/scAlleleCount

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JEFworks/HoneyBADGER/issues/17#issuecomment-459076395, or mute the thread https://github.com/notifications/unsubscribe-auth/AqcSaJMYs6uFoW9X3vsnXYPMmRWwHTiNks5vIfIzgaJpZM4Yazsi .

joeymays commented 5 years ago

Hum, in that case, please try the following:

# load ExAC chromosome 1 snps
load(system.file("ExAC", "ExAC_chr1.RData", package = "HoneyBADGER"))
head(snps)
seqlevels(snps)

# filter for only those with seqlevels in canonical chromosomes
chrs <- c(as.character(1:22), "X", "Y")
seqlevels(snps) <- chrs
head(snps)
seqlevels(snps)

Now you should be able to do

results <- getSnpMats10X(snps, bamFiles, indexFiles, barcodes)

If it works, I can update the getSnpMats10X() to do this by default.

Following up on this part of the discussion, I was having a similar problem where the snps object contained genomic ranges corresponding to GL unlocalized sequences. Filtering on canonical seqlevels as suggested above solved the issue for me.

pangxueyu233 commented 5 years ago

Hello, Jean it's appreciated working to analysis CNV in single cell transcriptome level. and recently, I used getSnpMats10X() to import our 10x data, and I found some problems as following:

that's the code of getSnpMats10X(): getSnpMats10X <- function (snps, bamFiles, indexFiles, barcodes, n.cores = 1, verbose = FALSE) { cov <- getCoverage(snps, bamFile, indexFile, verbose) if (verbose) { print("Snps with coverage:") print(table(cov > 0)) } vi <- cov > 0 if (sum(vi) == 0) { print("ERROR: NO SNPS WITH COVERAGE. Check if snps and bams are using the same alignment reference.") return(NULL) } ......

And have you noticed that parameters are not paired in following function getSnpMats10X <- function (snps, bamFiles, indexFiles, barcodes, n.cores = 1, verbose = FALSE) {....}

and following function like getCoverage() used bamFile and indexFile without 's' cov <- getCoverage(snps, bamFile, indexFile, verbose)

So, I just fixed it up by remove 's' in code of getSnpMats10X. Hope that's right for following analysis.

ccruizm commented 4 years ago

Hello @JEFworks! Thanks for this amazing tools. I have been trying to use it in my own data (scRNA 10x V3 - CellRanger 3.0.2) but have had some issues. In the beginning I use the function getCellAlleleCount and had the same issue posted by @zhuxqdoctor (also using the snps matrix from the HoneyBadger). Then after reading this thread I have followed your recommendations to read in 10x data. However, once I run the command getSnpMats10X, I got this error:

Error in getCoverage(snps, bamFile, indexFile, verbose): object 'bamFile' not found
Traceback:

1. getSnpMats10X(snps, bamFiles, indexFiles, barcodes, n.cores = 12)
2. getCoverage(snps, bamFile, indexFile, verbose)
3. pileup(file = bamFile, index = indexFile, scanBamParam = ScanBamParam(which = gr), 
 .     pileupParam = pp)

I really do not understand why it is happening because I am point out to the correct path to the file. Then, I checked the bam file generated by CellRanger and noticed that not all reads have the CB tag. Do you think this can be the source of the problem? Should I filter the bam file manually?

Thanks in advance for your help!

JEFworks commented 4 years ago

Hi @pangxueyu233 , Thanks so much for the catch. The bug has been fixed in commit 6c77f05

@ccruizm Please follow @pangxueyu233 's fix for the bug or copy the updated function with the correct parameter names:

getSnpMats10X <- function(snps, bamFile, indexFile, barcodes, n.cores=1, verbose=FALSE) {

  ## loop
  cov <- getCoverage(snps, bamFile, indexFile, verbose)

  ## any coverage?
  if(verbose) {
    print("Snps with coverage:")
    print(table(cov>0))
  }
  vi <- cov>0
  if(sum(vi)==0) {
    print('ERROR: NO SNPS WITH COVERAGE. Check if snps and bams are using the same alignment reference.')
    return(NULL)
  }
  cov <- cov[vi]
  snps.cov <- snps[vi,]

  if(verbose) {
    print("Getting allele counts...")
  }
  alleleCount <- getCellAlleleCount(snps.cov, bamFile, indexFile, barcodes, verbose=TRUE, n.cores=n.cores)
  refCount <- alleleCount[[1]]
  altCount <- alleleCount[[2]]

  ## check correspondence
  if(verbose) {
    print("altCount + refCount == cov:")
    print(table(altCount + refCount == cov))
    print("altCount + refCount < cov: sequencing errors")
    print(table(altCount + refCount < cov))
    ##vi <- which(altCount + refCount != cov, arr.ind=TRUE)
    ## some sequencing errors evident
    ##altCount[vi]
    ##refCount[vi]
    ##cov[vi]
  }

  results <- list(refCount=refCount, altCount=altCount, cov=cov)
  return(results)
}
davidcoffey commented 4 years ago

As long as the CB tag is present in all reads, it should be fine. To rule out the possibility of this being an issue with R/RSamtools, can you please try the python version for getting the allele counts information: https://github.com/barkasn/scAlleleCount

I had the same issue as reported by @helianthuszhu. I am using Cellranger 3.0 and confirmed that the CB tags are present in my BAM files. I identified the problem to be within the getCellAlleleCount function. The tf variable must be a character vector within a named list. Instead, it is a factor within a named list. So to solve the problem, change tf <- list(cell) to tf <- list(as.character(cell)) within the getCellAlleleCount function.