BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
284 stars 109 forks source link

Glioma classifier #599

Closed johnnyshkwan closed 9 months ago

johnnyshkwan commented 10 months ago

Hi,

I am interested in using the glioma classifier for my own methylation data (idat files). How can I do that in TCGAbiolinks?

Also, does TCGAbiolinks support glioma classification using methylation data from the new EPICv2 platform?

Thanks JK

tiagochst commented 10 months ago

Hi,

Here is one example using data from GDC. Since legacy is not available the code is outdated. The updated code is below. You will need to a matrix of beta-values as input. It should work with new EPICv2 platform, but I am not 100% sure.

https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/classifiers.html


query <- GDCquery(
    project = "TCGA-GBM",
    data.category = "DNA Methylation",
    barcode = c("TCGA-06-0122","TCGA-14-1456"),
    platform = "Illumina Human Methylation 27",
    data.type = "Methylation Beta Value"
)
GDCdownload(query)
dnam <- GDCprepare(query)

classification <- gliomaClassifier(dnam)
names(classification)
classification$final.classification
classification$model.classifications
classification$model.probabilities

TCGAquery_subtype("GBM") %>%
    dplyr::filter(patient %in% c("TCGA-06-0122","TCGA-14-1456")) %>%
    dplyr::select("patient","Supervised.DNA.Methylation.Cluster")
johnnyshkwan commented 9 months ago

Thanks. But I got the following error

library(TCGAbiolinks) sesameDataCache() betas = openSesame("idat_dir", prep="QCDPB") pred = gliomaClassifier(betas) Plese cite https://doi.org/10.1016/j.cell.2015.12.028 Training model described at https://bioconductor.org/packages/TCGAbiolinksGUI.data/ Error in predict.randomForest(modelFit, newdata): number of variables in newdata does not match that in the training data

tiagochst commented 9 months ago

Very likely it is missing a probe in the data.

Could you check which probe is missing ? Probably we would need to remove it from the model.

library(TCGAbiolinksGUI.data)
data(glioma.idh.model)
probes_in_training <- glioma.idh.model$trainingData %>% 
  colnames() %>% 
  grep("cg",.,value = TRUE)

setdiff(probes_in_training,rownames(betas))
johnnyshkwan commented 9 months ago

diff.txt Thanks for the quick response! Attached please find the differences

Below I also list the names of the first five probes in the betas matrix. I think that may be helpful. Thanks

cg00000029_TC21 cg00000109_TC21 cg00000155_BC21 cg00000158_BC21 cg00000165_TC21

tiagochst commented 9 months ago

All the probes are not found. I am not sure if the suffix was added openSesame, but you will need to remove.

Could you please, try renaming the probes names and then run the prediction.

You should be able to remove the suffix with the following code:

rownames(betas) <- gsub("_.*","",rownames(betas))
johnnyshkwan commented 9 months ago

I ran your code and the suffix was removed. But I got the error below.

`

pred = gliomaClassifier(betas) Plese cite https://doi.org/10.1016/j.cell.2015.12.028 Training model described at https://bioconductor.org/packages/TCGAbiolinksGUI.data/ [1] "NA columns" [1] "NA values" Error in predict.randomForest(modelFit, newdata): variables in the training data missing in newdata `

johnnyshkwan commented 9 months ago

Hi,

I got the same error even with EPICv1 array.
I believe the cause of the problem should be related to the betas matrix generation. Currently, the structure of my betas is as follows:

                                       Sample1                         Sample2

cg00000029 NA 0.2885715 cg00000103 NA NA cg00000109 0.9353217 0.8365564 cg00000155 0.9513250 0.9353968 cg00000158 0.9710062 0.9444841 cg00000165 0.2278563 0.5235999

Do you have any clues? Thanks!

tiagochst commented 9 months ago

@dummyHK I made some changes and also tested with GSE229715 which contains EPICv1 and EPICv2. I had to create the missing probes in the matrix and I set values to 0.5. Since it is a few probes, this should not affect too much the prediction.

Probably the model will need to be updated to use only probes shared between all platforms. But I don't have time right now.

Could you try with the latest version, please ?

devtools::install_github(repo = "BioinformaticsFMRP/TCGAbiolinks",ref = "devel")
tiagochst commented 9 months ago

For future reference:

library(dplyr)
library(purrr)
library(sesame)
library(caret)
library(randomForest)
devtools::load_all(".")
sesameDataCache()
idat_dir <- "~/Downloads/GSE229715_RAW/"
betas_list <- openSesame(idat_dir, prep = "QCDPB")
class(betas_list)
# Data has epicV1 and epicV2 samples

# test with epicV2 samples
betas_list_df <- purrr::imap(betas_list[1:3],.f = function(betas,name){
    df <- data.frame("probe" = gsub("_.*","",names(betas)),betas)
    colnames(df)[2] <- name
    df <- df[!duplicated(df$probe),]
    df
}) 
betas <- betas_list_df %>% purrr::reduce(dplyr::left_join)
rownames(betas) <- betas$probe
betas$probe <- NULL
pred <- gliomaClassifier(betas)

# test with epicV2 samples
betas_list_df <- purrr::imap(betas_list[31:32],.f = function(betas,name){
    df <- data.frame("probe" = gsub("_.*","",names(betas)),betas)
    colnames(df)[2] <- name
    df <- df[!duplicated(df$probe),]
    df
}) 
betas <- betas_list_df %>% purrr::reduce(dplyr::left_join)
rownames(betas) <- betas$probe
betas$probe <- NULL
pred <- gliomaClassifier(betas)
johnnyshkwan commented 9 months ago

Thanks! That works.