broadinstitute / infercnv

Inferring CNV from Single-Cell RNA-Seq
Other
557 stars 164 forks source link

how to handle data from multiple patients #429

Open rpeys opened 2 years ago

rpeys commented 2 years ago

Hello, I'm working with data from multiple patients and I'm wondering if infercnv has a way to handle paired tumor/normal from across multiple patients + healthy donors. Some patients contribute both normal and malignant cells. Right now (following the wiki's example), my input does not include which patient the cells come from, which seems like a shame.

Either way, I could use some help understanding how to generate my output of interest. I am inputting many normal/malignant cells from across different patients. I am specifically trying to understand the subclonal structure for one particular patient. Should I run infercnv::run() on all the data together, and then only visualize my patient of interest, or should I subset my input to my patient of interest + healthy donors and then run the infercnv algorithm?

In my digging around, I found the "cluster_by_groups" arg in run(), but I'm not sure how to supply the groups. I also found the "plot_per_group" function, but don't know how to supply the groups. Any pointers or examples would be appreciated! Thank you!

rpeys commented 2 years ago

update: I found useful info about how to annotate patient IDs here: https://github.com/broadinstitute/inferCNV/wiki/File-Definitions#sample-annotation-file

I still have these two outstanding questions: 1) Is there a way for the algorithm to take advantage of paired tumor/normal samples from patients? 2) If I am interested in one individual patient's subclonal structure, is the proper approach to run infercnv on just the normal samples + that one patient (I assume this will be faster than running of the full data)? Will I get the same results if I run on the full data, and then subset my visualization to my one sample of interest?

Thanks, Rebecca

GeorgescuC commented 2 years ago

Hi @rpeys ,

If you have paired tumor/normal samples from the same patient, you can simply run infercnv on those samples alone, using the normal as reference and the tumor as observations. If you provide data for other patients at the same time there is no way to specifically match one reference to one tumor.

It will indeed be faster to run infercnv on that one patient's data alone, and if your different samples are of similar type, the results should be largely similar, although they likely won't be perfectly identical because the initial filtering of low/not expressed genes will be affected by how many cells overall you have, and the mixing proportion between different cell types. Since you are interested in subclonal structure, I heavily recommend using the analysis_mode="subclusters" option If you do decide to run the HMM or the predictions likely won't be accurate.

If you decide to run infercnv with multiple patients' data together, besides the cluster_by_groups option you mentioned, the plot_per_group() plotting method can plot each group on a different figure, and the dynamic_resize option for plotting methods can make the observation heatmap taller to better see results when you have large datasets.

Regards, Christophe.

Cristinex commented 1 year ago

Hi @GeorgescuC,

Thank you for developing and maintaining inferCNV.

As https://github.com/broadinstitute/inferCNV/wiki/File-Definitions#sample-annotation-file mentioned,

The sample (ie. patient) information is encoded in the attribute name as "malignant_{patient}", which allows the tumor cells to be clustered and plotted according to sample (patient) in the heatmap.

If I got different types of cells annotated, together with the patients' information such that the annotation will become: "Epithelial_{patient1}", is there a way that I can get both clustering for no patient info and with patient info in the heatmaps' clustering? If so, it would be nice of you if you could provide detailed command for CreateInferCNVObject() and run()?

Best Regards, Cristine

GeorgescuC commented 1 year ago

Hi @Cristinex ,

If you use cluster_by_groups=FALSE, the patient information will not be used for clustering, but neither will the cell type information. If that is fine for your use case, then that's an option.

If you want to use the cell type information but not the patient information, the easiest would be to have an alternate annotation file where you only have the cell type as annotation, and use that as input when creating a different object. You can use the same run() command for both.

In either case, I recommend using different output folders for the different runs so you do not overwrite results from the first run with those of the second.

Regards, Christophe.