Closed ksaunders73 closed 2 years ago
Hi @ksaunders73,
This is not a straight forward question to answer.
All of the cluster centroids in the genefu
package were derived from RNA microarray data of their respective publications. Because the units of a microarray (fluorescence intensity or intensity ratio) are different from those of RNA sequencing (counts or FPKM or TPM), it is not clear-cut deciding how your counts/FPKM/TPM values should be processed to be comparable with the array based cluster centroids.
I recommend reading the PAM50 subtype paper, specifically the Methods section:
van ’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A. M., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R., & Friend, S. H. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415(6871), 530–536. https://doi.org/10.1038/415530a
My understanding is that they used log2 transformed expression ratios to conduct their clustering analysis. Therefore the centroids of their clusters will also be indicated in these units. From the supplementary Methods section for the aforementioned paper, the expression ratios were calculated as:
the logarithmic transcriptional expression level measured relative to a baseline condition
I was unable to find the definition of the baseline condition in the paper, maybe you can find it? Without knowing what the baseline for the expression ratios were it is hard to say how to make an analogous metric from counts/TPM.
My instinct would be to divide the TPM by the average or median for each gene across your patient cohort, but whether this is scientifically valid or not is a call you will need to make. It is possible they used a normal sample for their baseline.
Once you decide on how to get a log expression ratio from your Seurat data, you should apply the genefu::rescale
function to the expression matrix since this is what has been done for the pam50.robust
cluster centroids. It is also worth noting that the molecular.subtyping
function always uses the robust variant of the cluster centroid data.
Information about different centroids can be found in the genefu
package help, e.g. using ?pam50
. This will include a reference to the publication from which the centroid data was retrieved.
Given that this package was designed for classifying data from Affymetrix microarrays, I am not sure it is optimal to adapt it for use on RNA sequencing data. You may want to consider an RNA seq based clustering algorithm due to the above technical considerations.
Hopefully that helps.
Best, Christopher Eeles Software Developer BHK Lab | PM-Research | UHN
Thank you very much @ChristopherEeles!
Hi @ksaunders73,
I am going to close this issue. If you have further questions feel free to re-open this thread or file a new issue.
Best, Christopher Eeles Software Developer BHK Lab | PM-Research | UHN
Excuse me, how to use single-cell data for PAM50 analysis, what does the input expression matrix look like, and which normalization method should be used?
你好!
感谢您提供的优质包裹!我想在我的 Seurat 对象上使用genefu 的molecular.subtyping() 函数(使用pam.50.robust 模型),并且想知道Seurat 对象是否应该是
- 仅使用NormalizeData()预先标准化
- 使用ScaleData()标准化后额外缩放
感谢您的阅读!
Excuse me, how to use single-cell data for PAM50 analysis, what does the input expression matrix look like, and which normalization method should be used?
It has come to my attention that the paper I cited above is not the original PAM50 publication. However, the discussion still applies.
Hello!
Thank you for the excellent package! I would like to use genefu's molecular.subtyping() function (using the pam.50.robust model) on my Seurat object, and was wondering whether the Seurat object should be
Thank you for reading!