diazlab / CONICS

CONICS: COpy-Number analysis In single-Cell RNA-Sequencing
73 stars 28 forks source link

Confusion about the calcNormFactors #3

Closed ysbioinfo closed 6 years ago

ysbioinfo commented 6 years ago

So sorry to bother you again. There is no issue about the software. I am just confused about some description in the guide of CONICSmat. There is a step called calcNormFactors, in which you calculate the mean expression level of each cell and use it as the normalized factor. In the guidance you say that "the more genes expressed in one cell, the less reads are "available" per gene". I can understand this sentence, and think it is reasonable. But what confused me is, could this method (colMeans(suva_expr)) really achieve this goal? If there are two cells: C1 and C2. In C1, 100 genes are expressed, each with a expression level of 10; and in C2, 10 genes are expressed, each with a expression level of 100. So, the average expression level (colMeans) of the two cell are the same; they have the same normalized factor but one with 100 genes expressed and one with only 10. This is what I am confused about. By the way, I also want to ask, which subsequent step needs this normfactor? I am just a beginner of single RNA seq, when I deal with the bulk RNA-seq data before. like do some differential expression analysis. I remember only the CPM/RPKM is enough, as it consider the different reads number between each sample. I don't remember there is a step for calculating this factor. (Maybe those software hide this step in their function and I didn't find it). Please forgive me for asking such lowly question. Thanks!

soerenmueller commented 6 years ago

Hi snoopy, that's a great question. And you are right, in the end the function became:

calcNormFactors = function (expmat){ n=colMeans(expmat) return(n) }

I had played with several other ways to center the expression in each cell, but this one turned out to be the most simple and reliable one. The basic idea behind it is that in cells where you have measured significantly more genes than in others, for example we measure more expressed genes in cancer than in normal cells, the expression per gene and therefore the average expression from a certain chromosome differs between the two (because NGS is a "competitive process"). Anders and Huber have addressed this in their normalization strategy for DESeq.

Long story short: By subtracting the average expression we are centering the gene expression values in each cell to account for differences due to the number of genes detected in a cell. This could also be achieved using regression or some similar strategies.

Hope that helps, S

ysbioinfo commented 6 years ago

Apologize for late reply! Github did not send me an email... Thanks a lot!

ysbioinfo commented 6 years ago

I have another question, have you tried conics at dropseq data? Cause the genes detected per cell are much less than smartseq, I wonder if it is reasonable to infer CNV from dropseq data... By the way, if I have multiple samples and I want to plot them in one heatmap using plotChromosomeHeatmap, and want to label each patient with different color left to the heatmap. Could you tell me how to do this?

diazlab commented 6 years ago

Yes, we've tried it on 10X data: https://doi.org/10.1186/s13059-017-1362-4 It works pretty well. It works best for larger CNVs.

I'll have to think about your labels question...

soerenmueller commented 6 years ago

We've also used it in a second paper that's currently available as a pre-print

https://www.biorxiv.org/content/early/2018/04/12/272328

preprint_clones

This is 10X data from a single case in which we identify multiple clones using CONICSmat. As already mentioned it works better the larger the CNVs.

As for plotting the patient ID next to the heatmap: There is currently no function for that, but I assume your matrix is ordered by patients? If so, you can simply draw a horizontal line for cells of each patient.

For example (assuming the first 1234 cells of your matrix are from patient1):

abline(h=1234,lty=16,lwd=0.8)

Hope this helps!

ysbioinfo commented 6 years ago

Thank you very much, indeed my matrix is not ordered. I want to label the patient after clustering and see if cells from one patient cluster together.

soerenmueller commented 6 years ago

Makes sense! I will try to integrate that functionality into the package. Which function are you using to plot the results where you want the bar for each patient integrated?

ysbioinfo commented 6 years ago

Thanks a lot! I want to plot using plotChromosomeHeatmap and generate some picture like this. image

ysbioinfo commented 6 years ago

By the way, is the data in https://www.biorxiv.org/content/early/2018/04/12/272328 available now? Only the expression matrix will be enough. Thanks!