how to filter out mouse cells to get a matrix only contained human cells from a mixture library by cellranger?

10XGenomics / cellranger

10x Genomics Single Cell Analysis

https://www.10xgenomics.com/support/software/cell-ranger

Other

349 stars 92 forks source link

how to filter out mouse cells to get a matrix only contained human cells from a mixture library by cellranger? #24

Closed mzhoufulai closed 4 years ago

mzhoufulai commented 5 years ago

Hi, I am new in single-cell RNA-seq. I got a library with a mixture of human and mouse cells. The mouse cells ratio is only 2.5%. I used cellranger count to get the matrix. But the output of cellranger count is a mix of human and mouse. How I can filter out mouse cells and only get a matrix of human cells? Thanks!

evolvedmicrobe commented 4 years ago

Hi,

Cell Ranger will attempt to identify which species each cell is derived from, but the approach does not work well when the ratio is far from 50:50.

My recommendation would be to use the filtered matrices directly to make your own determination. Instructions on loading the filtered barcode matrix in either R or Python are available here:

https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices

Once loaded, genes from either genome are prefixed with the genome name (e.g. mm10 or hg19) and you can sum the count of genes from each genome for each barcode to determine the species based on the relative count of mouse genes to human genes.

support@10xgenomics.com can likely provide additional follow up.

Warm wishes, Nigel

YuliaInn commented 4 years ago

I have the same issue. You said we can determine the species based on the relative count of mouse genes to human genes. Is there any threshold or something for that? thank you

ahdee commented 2 years ago

@evolvedmicrobe this is an older closed issue but we are having a similar issue. Can you please elaborate on your solution? For example, you mentioned,

Cell Ranger will attempt to identify which species each cell is derived from, but the approach does not work well when the ratio is far from 50:50.

in this case do we use the prebuild human+mouse index? I think on the dl page is called, "GRCh38_and_mm10-2020-A_build" file, refdata-gex-GRCh38-and-mm10-2020-A.tar.gz

thanks.

evolvedmicrobe commented 2 years ago

@ahdee if you can post this image from the websummary (only for your sample), I might be able to advise on a path forward for you.

ahdee commented 2 years ago

@evolvedmicrobe thanks. Ok here is one of my samples. In the summary mm10 only mapped to genome .5% while h38 maps 97% however I do see that there a a few really high logFC mm10 genes. What do you advice?

evolvedmicrobe commented 2 years ago

@ahdee it appears you don't have any mouse cells in that sample. The human UMI counts per barcode is typically >1K, while you don't observe any barcodes with mouse counts >60, and I suspect those are mapping artifacts and not real mouse DNA. Are you sure you have mouse cells in this sample?

ahdee commented 2 years ago

@evolvedmicrobe thanks. However, I'm still a bit confused. Please see the attached image. I took another sample and aligned it just a simple GRCh38; I also align it with GRCh38_mm10; so first question is.

why does the estimated cell go from 7K to 36K
on the scatter plot, what is happening here since most of the dots are multiplet? on the summary it mm10 Reads Mapped to Genome is only about .5%; should I just forget doing this and just go with the simple single GRCh38 alignment? Here is the image.

evolvedmicrobe commented 2 years ago

Hi @ahdee, yes my advice would be to go with just the GRCH38 genome, as you do not appear to have any mouse cells in this data.

Cell calling is done on a per-genome basis, and calls everything within an order of magnitude of the top of the rank plot (e.g. if your 1% percentile of barcodes is 10K, everything >=1K will be called as a cell in the first step of cell calling. Because you have no meaningful mouse cells, this means that barcodes with even < 5 UMI are counted as cell-associated barcodes, and these are non-sense calls that artificially increase your number of cells. These cells are then called multiplets because they have often have much higher human umi counts than mouse counts (and I'd basically ignore the multiplet calls for a dataset like this).

ahdee commented 2 years ago

@evolvedmicrobe thanks for such a great explanation. I guess this is why the y-axis for mouse were so low 0-30 instead 0-15K. This makes sense to me. thanks.