Closed johnnyshkwan closed 1 year ago
Hi,
Here is one example using data from GDC. Since legacy is not available the code is outdated. The updated code is below. You will need to a matrix of beta-values as input. It should work with new EPICv2 platform, but I am not 100% sure.
https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/classifiers.html
query <- GDCquery(
project = "TCGA-GBM",
data.category = "DNA Methylation",
barcode = c("TCGA-06-0122","TCGA-14-1456"),
platform = "Illumina Human Methylation 27",
data.type = "Methylation Beta Value"
)
GDCdownload(query)
dnam <- GDCprepare(query)
classification <- gliomaClassifier(dnam)
names(classification)
classification$final.classification
classification$model.classifications
classification$model.probabilities
TCGAquery_subtype("GBM") %>%
dplyr::filter(patient %in% c("TCGA-06-0122","TCGA-14-1456")) %>%
dplyr::select("patient","Supervised.DNA.Methylation.Cluster")
Thanks. But I got the following error
library(TCGAbiolinks) sesameDataCache() betas = openSesame("idat_dir", prep="QCDPB") pred = gliomaClassifier(betas) Plese cite https://doi.org/10.1016/j.cell.2015.12.028 Training model described at https://bioconductor.org/packages/TCGAbiolinksGUI.data/ Error in predict.randomForest(modelFit, newdata): number of variables in newdata does not match that in the training data
Very likely it is missing a probe in the data.
Could you check which probe is missing ? Probably we would need to remove it from the model.
library(TCGAbiolinksGUI.data)
data(glioma.idh.model)
probes_in_training <- glioma.idh.model$trainingData %>%
colnames() %>%
grep("cg",.,value = TRUE)
setdiff(probes_in_training,rownames(betas))
diff.txt Thanks for the quick response! Attached please find the differences
Below I also list the names of the first five probes in the betas matrix. I think that may be helpful. Thanks
cg00000029_TC21 cg00000109_TC21 cg00000155_BC21 cg00000158_BC21 cg00000165_TC21
All the probes are not found. I am not sure if the suffix was added openSesame
, but you will need to remove.
Could you please, try renaming the probes names and then run the prediction.
You should be able to remove the suffix with the following code:
rownames(betas) <- gsub("_.*","",rownames(betas))
I ran your code and the suffix was removed. But I got the error below.
`
pred = gliomaClassifier(betas) Plese cite https://doi.org/10.1016/j.cell.2015.12.028 Training model described at https://bioconductor.org/packages/TCGAbiolinksGUI.data/ [1] "NA columns" [1] "NA values" Error in predict.randomForest(modelFit, newdata): variables in the training data missing in newdata `
Hi,
I got the same error even with EPICv1 array.
I believe the cause of the problem should be related to the betas matrix generation.
Currently, the structure of my betas is as follows:
Sample1 Sample2
cg00000029 NA 0.2885715 cg00000103 NA NA cg00000109 0.9353217 0.8365564 cg00000155 0.9513250 0.9353968 cg00000158 0.9710062 0.9444841 cg00000165 0.2278563 0.5235999
Do you have any clues? Thanks!
@dummyHK I made some changes and also tested with GSE229715 which contains EPICv1 and EPICv2. I had to create the missing probes in the matrix and I set values to 0.5. Since it is a few probes, this should not affect too much the prediction.
Probably the model will need to be updated to use only probes shared between all platforms. But I don't have time right now.
Could you try with the latest version, please ?
devtools::install_github(repo = "BioinformaticsFMRP/TCGAbiolinks",ref = "devel")
For future reference:
library(dplyr)
library(purrr)
library(sesame)
library(caret)
library(randomForest)
devtools::load_all(".")
sesameDataCache()
idat_dir <- "~/Downloads/GSE229715_RAW/"
betas_list <- openSesame(idat_dir, prep = "QCDPB")
class(betas_list)
# Data has epicV1 and epicV2 samples
# test with epicV2 samples
betas_list_df <- purrr::imap(betas_list[1:3],.f = function(betas,name){
df <- data.frame("probe" = gsub("_.*","",names(betas)),betas)
colnames(df)[2] <- name
df <- df[!duplicated(df$probe),]
df
})
betas <- betas_list_df %>% purrr::reduce(dplyr::left_join)
rownames(betas) <- betas$probe
betas$probe <- NULL
pred <- gliomaClassifier(betas)
# test with epicV2 samples
betas_list_df <- purrr::imap(betas_list[31:32],.f = function(betas,name){
df <- data.frame("probe" = gsub("_.*","",names(betas)),betas)
colnames(df)[2] <- name
df <- df[!duplicated(df$probe),]
df
})
betas <- betas_list_df %>% purrr::reduce(dplyr::left_join)
rownames(betas) <- betas$probe
betas$probe <- NULL
pred <- gliomaClassifier(betas)
Thanks! That works.
Hi,
I am interested in using the glioma classifier for my own methylation data (idat files). How can I do that in TCGAbiolinks?
Also, does TCGAbiolinks support glioma classification using methylation data from the new EPICv2 platform?
Thanks JK