bhklab / genefu

R package providing various functions relevant for gene expression analysis with emphasis on breast cancer.
25 stars 13 forks source link

Seurat and genefu #22

Closed ksaunders73 closed 2 years ago

ksaunders73 commented 2 years ago

Hello!

Thank you for the excellent package! I would like to use genefu's molecular.subtyping() function (using the pam.50.robust model) on my Seurat object, and was wondering whether the Seurat object should be

  1. only normalized beforehand with NormalizeData()
  2. additionally scaled after normalization using ScaleData()

Thank you for reading!

ChristopherEeles commented 2 years ago

Hi @ksaunders73,

This is not a straight forward question to answer.

All of the cluster centroids in the genefu package were derived from RNA microarray data of their respective publications. Because the units of a microarray (fluorescence intensity or intensity ratio) are different from those of RNA sequencing (counts or FPKM or TPM), it is not clear-cut deciding how your counts/FPKM/TPM values should be processed to be comparable with the array based cluster centroids.

I recommend reading the PAM50 subtype paper, specifically the Methods section:

van ’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A. M., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R., & Friend, S. H. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415(6871), 530–536. https://doi.org/10.1038/415530a

My understanding is that they used log2 transformed expression ratios to conduct their clustering analysis. Therefore the centroids of their clusters will also be indicated in these units. From the supplementary Methods section for the aforementioned paper, the expression ratios were calculated as:

the logarithmic transcriptional expression level measured relative to a baseline condition

I was unable to find the definition of the baseline condition in the paper, maybe you can find it? Without knowing what the baseline for the expression ratios were it is hard to say how to make an analogous metric from counts/TPM.

My instinct would be to divide the TPM by the average or median for each gene across your patient cohort, but whether this is scientifically valid or not is a call you will need to make. It is possible they used a normal sample for their baseline.

Once you decide on how to get a log expression ratio from your Seurat data, you should apply the genefu::rescale function to the expression matrix since this is what has been done for the pam50.robust cluster centroids. It is also worth noting that the molecular.subtyping function always uses the robust variant of the cluster centroid data.

Information about different centroids can be found in the genefu package help, e.g. using ?pam50. This will include a reference to the publication from which the centroid data was retrieved.

Given that this package was designed for classifying data from Affymetrix microarrays, I am not sure it is optimal to adapt it for use on RNA sequencing data. You may want to consider an RNA seq based clustering algorithm due to the above technical considerations.

Hopefully that helps.

Best, Christopher Eeles Software Developer BHK Lab | PM-Research | UHN

ksaunders73 commented 2 years ago

Thank you very much @ChristopherEeles!

ChristopherEeles commented 2 years ago

Hi @ksaunders73,

I am going to close this issue. If you have further questions feel free to re-open this thread or file a new issue.

Best, Christopher Eeles Software Developer BHK Lab | PM-Research | UHN

zhangjl-work commented 2 years ago

Excuse me, how to use single-cell data for PAM50 analysis, what does the input expression matrix look like, and which normalization method should be used?

zhangjl-work commented 2 years ago

你好!

感谢您提供的优质包裹!我想在我的 Seurat 对象上使用genefu 的molecular.subtyping() 函数(使用pam.50.robust 模型),并且想知道Seurat 对象是否应该是

  1. 仅使用NormalizeData()预先标准化
  2. 使用ScaleData()标准化后额外缩放

感谢您的阅读!

Excuse me, how to use single-cell data for PAM50 analysis, what does the input expression matrix look like, and which normalization method should be used?

ChristopherEeles commented 2 years ago

It has come to my attention that the paper I cited above is not the original PAM50 publication. However, the discussion still applies.