How to deal with multiple patients?

Hi @zhouzhendiao ,

Whether you run infercnv on each patient separately or combined should only affect the initial filtering of genes that are expressed below the filtering threshold and following counts normalization, as long as you use the same set of reference cells. I would always use the combined set for the reference cells. One workaround for the gene filtering would be to:

run infercnv's step 2 on the combined data
split the matrix in per patient matrices (the references should be duplicated in each)
run infercnv per patient using those matrices as input and disabling all filtering options so that no further filtering is done For the normalization however, the option to manually set a normalization factor is not exposed, so you would need to clone the github repo and manually edit the step 3 call to normalize_counts_by_seq_depth to use the normalization_factor that is calculated for the combined data on each per patient run.

From my runs however, 60k cells x 8000 genes should not require 400GB+ of RAM if you use the Leiden subclustering option and the leiden_resolution is fitting (too high a resolution can produce too many subclusters that are very small and thus an absurdly high number of CNV regions that need to be evaluated by the costly Bayesian network step). If you are not using a sparse matrix as input already, there is also a script available that first makes a sparse matrix out of your input matrix on disk (and reads it much faster too) which reduces the starting memory size before filtering and until smoothing.

Regards, Christophe.

broadinstitute / infercnv

How to deal with multiple patients? #610