bhklab / genefu

R package providing various functions relevant for gene expression analysis with emphasis on breast cancer.
25 stars 13 forks source link

preprocessing of test data for genefu #28

Open rocanja opened 2 years ago

rocanja commented 2 years ago

Dear genefu team,

Thank you for providing this great resource! I have a few questions regarding the preprocessing of test data for use with genefu, in particular for PAM50 classification. I understand that there's no 'one-fits-all' approach, but would appreciate your input/recommendations, please.

1) Based on the examples in the vignette, I assume the test data matrix is expected to contain all genes/probes on the array and the 'molecular.subtyping' function then extracts the 50 PAM50 genes based on the provided annotation? How is the data from multiple probes collapsed to gene level? What if the input data matrix only contained those 50 genes, would that affect the classification results?

2) Since PAM50 is based on microarray data, I assume the test data is expected to be log2intensities. Is there any normalisation of the test data expected/recommended before input into genefu eg. quantile normalisation across samples and/or gene-wise scaling? Or does the 'molecular.subtyping' function do any required normalisation 'under the hood'?

3) In Cascianelli et al 2020, the authors state that "before calculating distances from subtype centroids, gene expression values for each sample must be transformed into Log2ratios against a reference sample, to be defined for each dataset. Typically, to avoid representation bias, such reference is constructed within the dataset by calculating for each gene the median across a subset of samples with a fixed proportion (60/40) of Estrogen Receptor-positive (ER+) and -negative (ER−) cases, as done for the original PAM50 training." Is this something genefu does in the background or is the user expected to provide the input data matrix as log2ratios to a 'reference' as described above?

4) I have seen the other entries regarding RNAseq data input and appreciate that PAM50 has not been designed for rnaseq data. However, in your Fumagalli et al 2014 publication, you have used log2(FPKM+1) values and thus, would you think this is a good place to start for classification on rnaseq data using genefu or has your view on this changed since then?

5) What's the difference between molecular.subtyping(), intrinsic.cluster.predict() and subtype.cluster.predict() ... are the latter ones just older/deprecated versions of the first function?

Kind regards, rocanja

ChristopherEeles commented 2 years ago

Hi @rocanja,

There is a lot to unpack here so I will probably need to address your questions one a time.

RE: (1), if you view the function documentation by running ?molecular.subtyping in the R console or consulting the PDF manual on Bioconductor it is indicated there that in cases of ambiguity the most variant probe is kept.

For annotation, this package only ever considers the genes in the signature in the classification. The data matrix is subset to genes in the signature internally. As a result your annotation file just needs to map from the rownames of your input data (i.e., the feature names) from a column called "probe" to Entrez gene ids in a column called "Entrez.ID".

To normalize your data, use the rescale method, which uses a form of quantile normalization. You should then use the .robust version of of the signatures, which have already been normalized with the same method. I recommend printing out the signature of interest, since it contains information about how it was processed, as well as the map that will be used to get from "Entrez.ID" to the signature specific identifiers.

I will try to address your other questions shortly.

Best, Christopher Eeles Software Developer Haibe-Kains Lab PM-Research | UHN

rocanja commented 2 years ago

Thank you so much for your feedback so far and your time. Looking forward to learn more.