broadinstitute / infercnv

Inferring CNV from Single-Cell RNA-Seq
Other
557 stars 164 forks source link

Filtering of genes on Step 2 of -infercnv::run()- #622

Open jemorlanes opened 10 months ago

jemorlanes commented 10 months ago

Hey all!

Super cool package :). I have a question regarding the method used to remove genes with low expression when running inferCNV. This is specified in the cutoff = argument, which says that it looks for genes that are expressed above the specified threshold in the reference cells. I am using cutoff = 0.1

I am currently working with 2 datasets: my reference dataset and my query dataset. The reference dataset is quite more sparse than my query. When I run by reference dataset by itself (ref_group_names = NULL in CreateInfercnvObject()), only around 1000 genes make it past the cutoff

`STEP 02: Removing lowly expressed genes

INFO [2023-11-29 16:27:16] ::above_min_mean_expr_cutoff:Start INFO [2023-11-29 16:27:16] Removing 15837 genes from matrix as below mean expr threshold: 0.1 INFO [2023-11-29 16:27:16] validating infercnv_obj INFO [2023-11-29 16:27:16] There are 1100 genes and 202 cells remaining in the expr matrix. INFO [2023-11-29 16:27:16] no genes removed due to min cells/gene filter INFO [2023-11-29 16:27:16] `

However, when I use these cells as a reference in order to infer CNVs in my query data, 10000 genes satisfy the cutoff condition. Since cutoff = should remove genes based on the reference cells, I would expect only 1000 genes to make it through, but this is not the case.

`STEP 02: Removing lowly expressed genes

INFO [2023-11-29 17:38:09] ::above_min_mean_expr_cutoff:Start INFO [2023-11-29 17:38:10] Removing 1521 genes from matrix as below mean expr threshold: 0.1 INFO [2023-11-29 17:38:10] validating infercnv_obj INFO [2023-11-29 17:38:10] There are 11069 genes and 2326 cells remaining in the expr matrix. INFO [2023-11-29 17:38:13] no genes removed due to min cells/gene filter INFO [2023-11-29 17:38:42] `

Out of curiosity, I plotted the average number of counts of each gene within my a) reference dataset, b) query dataset and c) reference + query dataset, and checked how many genes had a greater average expression than 0.1:

  1. Reference dataset: 1154 genes above 0.1
  2. Query dataset: 8713 genes above 0.1
  3. Reference + query datasets: 8713 genes above 0.1

I am a bit confused by these discrepancies. I might be missing something in how inferCNV curates the gene list, so i was wondering if you could pinpoint me into a direction that might explain these missmatches.

Thank you!! :)