bhklab / genefu

R package providing various functions relevant for gene expression analysis with emphasis on breast cancer.
25 stars 13 forks source link

Pam50 and Pam50.robust models are identical #1

Closed rvimieiro closed 2 years ago

rvimieiro commented 9 years ago

Hi,

there might be a mistake related to pam50 and pam50.robust models. The help page states:

pam50 Use of the official centroids without scaling of the gene expressions.

pam50.scale Use of the official centroids with traditional scaling of the gene expressions (see scale).

pam50.robust Use of the official centroids with robust scaling of the gene expressions (see rescale).

However, the models differ from each other only regarding the attribute standardization (std). The following code

sapply(names(genefu::pam50),
            function(x) identical(genefu::pam50[x],genefu::pam50.robust[x]))

results in

      method.cor method.centroids              std        rescale.q             mins 
            TRUE             TRUE            FALSE             TRUE             TRUE 
       centroids    centroids.map 
            TRUE             TRUE

The question is: these centroids match the ones found at https://genome.unc.edu/pubsup/breastGEO/pam50_centroids.txt, but are these the scaled or not scaled version?

Having both exactly the same (i.e. assuming they are supposed to be the same), except for the standardization parameter, is misleading because people might attempt to use the non-standardized/scaled model with their standardized/scaled data (or vice-versa) and get completely wrong results.

I hope it will help!

Regards

Renato

Update:

The same happens with pam50.scale indicating all of them are identical except for the std variable.

bhaibeka commented 9 years ago

The models are identical but the way new dataset are rescaled for predictions is different. Ideally the centroids should have been re-estimated from the dataset of Parker et al 2009, but this was not the case.

rvimieiro commented 9 years ago

Thanks for the reply. I got what you said. At first, I was left with the impression that you did re-estimate the centroids from rescaled data using either of the methods. But actually the different models are just shortcuts for rescaling the data before predicting labels.