BaderLab / netDx

R package with netDx software and data for examples
Other
12 stars 9 forks source link

Trouble with Reproducing Results in Publication #91

Closed pryabinin closed 2 years ago

pryabinin commented 2 years ago

I am trying to reproduce the results from the publication: "netDx: interpretable patient classification using integrated patient similarity networks" (https://www.embopress.org/doi/full/10.15252/msb.20188497). Namely, the part of the paper that deals with classifying samples as Luminal A or Other using a pathway level features (it is under the section "Pathway-level feature selection identifies cellular processes predictive of clinical condition").

I am encountering an issue when I run buildPredictor() where the output's "predictions" slot of the split data is a data frame with zero rows. While buildPredictor completes successfully, running getResults() returns error that is caused by this issue:

Error in ROCR::prediction(dat[, idx1] - dat[, idx2], dat$STATUS == predClasses[1]) :
  Number of classes is not equal to 2.
ROCR currently supports only evaluation of binary classification tasks.

The gene expression data is downloaded from here: https://gdc.cancer.gov/about-data/publications/brca_2012

Below is my code, here are direct links to the data: https://api.gdc.cancer.gov/data/3a4451c4-fd0d-4c1e-be49-4fb88bd1beaf https://api.gdc.cancer.gov/data/af7e82cd-acae-4474-b644-9e8e55acfaf1 http://download.baderlab.org/EM_Genesets/February_01_2018/Human/symbol/Human_AllPathways_February_01_2018_symbol.gmt

Create MultiAssayExperiment from data


library(netDx)
library(MultiAssayExperiment)

sample.info <- read.delim("BRCA.datafreeze.20120227.txt",stringsAsFactors=F)
gene.expr <- read.delim("BRCA.exp.348.med.txt",stringsAsFactors = F,row.names=1,check.names = F)
# alter sample names to match the names in the annotation file
names(gene.expr) <- substr(names(gene.expr),1,16)
row.names(sample.info) <- sample.info$Sample

# filter data to only include the samples with gene expression, one sample per patient and order them the same way as the gene expression
sample.info <- sample.info[sample.info$Sample %in% names(gene.expr) &
                             !duplicated(sample.info$bcr_patient_barcode),]
gene.expr <- gene.expr[,match(sample.info$Sample,names(gene.expr))]
gene.expr.list <- list(gene.expr=as.matrix(gene.expr))

mae <- MultiAssayExperiment(experiments=gene.expr.list,
    colData=sample.info)

Prepare data for netDx. Create sample group and sample ID variables

pam50 <- colData(mae)$PAM50
pam50[which(!(pam50 %in% "LumA"))] <- "notLumA"
pam50[which(pam50 %in% "LumA")] <- "LumA"
colData(mae)$pam_mod <- pam50
table(pam50)

# remove hyphens from patient ID
colData(mae)$ID <- colData(mae)$bcr_patient_barcode
colData(mae)$ID <- stringr::str_replace_all(colData(mae)$ID,"-","_")
colData(mae)$STATUS <- colData(mae)$pam_mod

Prepare pathways to group genes

pathway.dat <- readLines("Human_AllPathways_February_01_2018_symbol.gmt")
names(pathway.dat) <- sapply(pathway.dat,function(x) {
  s <- strsplit(x,"\t")
  return(s[[1]][1])
})

pathway.dat.list <- lapply(pathway.dat,function(x) {
  s <- strsplit(x,"\t")
  s <- s[[1]][3:length(s[[1]])]
  #only keep pathway genes that are in the gene expression data
  s <- s[s %in% row.names(assays(mae)[["gene.expr"]])]
  return(s)
})
#remove empty pathways
pathway.dat.list <- pathway.dat.list[sapply(pathway.dat.list,length)>0]

# Rename pathways to unique identifier:
names(pathway.dat.list) <- paste0("pathway_",1:length(pathway.dat.list))
grouplist <- list("gene.expr"=pathway.dat.list)

Define patient similarity measure. Only need one for gene expression

sims <- list("gene.expr"="pearsonCorr")

Build Predictor

nco <- round(parallel::detectCores()*0.50) # use 50% available cores
message(sprintf("Using %i of %i cores", nco, parallel::detectCores()))

outDir <- paste0(getwd(),"/luminalA_paperrep")
if (file.exists(outDir)) unlink(outDir,recursive=TRUE)

t0 <- Sys.time()
model <- buildPredictor(
    dataList=mae,          ## your data
    groupList=grouplist,    ## grouping strategy
    sims=sims,
    outDir=outDir,          ## output directory
    trainProp=0.8,          ## pct of samples to use to train model in
                                ## each split
    numSplits=1L,            # just do one split in order to test
    featSelCutoff=9L,       ## threshold for calling something
                                ## feature-selected
    featScoreMax=10L,    ## max score for feature selection
 numCores=nco,            ## set higher for parallelizing
 debugMode=FALSE,
 keepAllData=F 
  )
t1 <- Sys.time()
print(t1-t0)
Predictor started at:
2022-05-13 12:33:23
-------------------------------
# patients = 348
# classes = 2 { notLumA,LumA }
Sample breakdown by class
154194
1 train/test splits
Feature selection cutoff = 9 of 10
Datapoints:
        gene.expr: 17814 units
Similarity metrics provided:
$gene.expr
[1] "pearsonCorr"

-------------------------------

-------------------------------
Train/test split # 1
-------------------------------
         IS_TRAIN
STATUS    TRAIN TEST
  LumA      123   31
  notLumA   155   39
# values per feature (training)
        Group gene.expr: 17814 values
** Creating features
Making nets from sims
        gene.expr
Layer gene.expr: PEARSON CORR
Pearson similarity chosen - enforcing min. 5 patients per net.
Net construction complete!
** Compiling features

** Running feature selection
        Class: notLumA

nonpred notLumA    <NA>
    123     155       0
        Scoring features
        Writing queries:

                155 IDs; 10 queries (139 sampled, 16 test)
                Q1: 16 test;  139 query
                Q2: 16 test;  139 query
                Q3: 16 test;  139 query
                Q4: 16 test;  139 query
                Q5: 16 test;  139 query
                Q6: 16 test;  139 query
                Q7: 16 test;  139 query
                Q8: 16 test;  139 query
                Q9: 16 test;  139 query
                Q10: 11 test;  144 query
Java 8 detected
QueryRunner time taken: 28.5 s
        Compiling feature scores

        Class: LumA

   LumA nonpred    <NA>
    123     155       0
        Scoring features
        Writing queries:

                123 IDs; 10 queries (110 sampled, 12 test)
                Q1: 12 test;  111 query
                Q2: 12 test;  111 query
                Q3: 12 test;  111 query
                Q4: 12 test;  111 query
                Q5: 12 test;  111 query
                Q6: 12 test;  111 query
                Q7: 12 test;  111 query
                Q8: 12 test;  111 query
                Q9: 12 test;  111 query
                Q10: 15 test;  108 query
Java 8 detected
QueryRunner time taken: 24.3 s
        Compiling feature scores

** Predicting labels for test
notLumA
        2500 feature(s) selected
        notLumA: Create & compile features
        Filter set provided
                gene.expr: 2500 of 3151 nets pass
Making nets from sims
        gene.expr
Layer gene.expr: PEARSON CORR
Pearson similarity chosen - enforcing min. 5 patients per net.
Net construction complete!
        ** notLumA: Compute similarity
LumA
        2500 feature(s) selected
        LumA: Create & compile features
        Filter set provided
                gene.expr: 2500 of 3151 nets pass
Making nets from sims
        gene.expr
Layer gene.expr: PEARSON CORR
Pearson similarity chosen - enforcing min. 5 patients per net.
Net construction complete!
        ** LumA: Compute similarity

** Predict labels
Split 1: ACCURACY (N=0 test) = NaN%

----------------------------------------
Predictor completed at:
2022-05-13 12:40:48

Get results

results <- getResults(
    model,
    unique(colData(brca)$STATUS),
    featureSelCutoff=1L,
    featureSelPct=0.70
  )
Detected 1 splits and 2 classes
* Plotting performance
Error in ROCR::prediction(dat[, idx1] - dat[, idx2], dat$STATUS == predClasses[1]) :
  Number of classes is not equal to 2.
ROCR currently supports only evaluation of binary classification tasks.

Examine the buildPredictor() output

model[[2]][["predictions"]]
 [1] ID                             Sample
 [3] bcr_patient_barcode            Tissue
 [5] PAM50                          Triple.Negative
 [7] Batch                          Histology
 [9] UNC.microarray                 BCGSC.miRNAseqs
[11] Broad.CN                       Methylation
[13] WU.seq.20111103                RPPA
[15] Integration.Freeze             Integration.RPPA.Freeze.batch1
[17] pam_mod                        STATUS
[19] TT_STATUS                      notLumA_SCORE
[21] LumA_SCORE                     PRED_CLASS
<0 rows> (or 0-length row.names)

Also, I am able to run the vignette successfully.

pryabinin commented 2 years ago

The problem was caused by the fact that the ID variable in colData was not the same as the row names of the sample annotation when the MultiAssayExperiment object was created. In order to fix this problem, two changes need to be made, first change the following code from: colData(mae)$ID <- colData(mae)$bcr_patient_barcode to colData(mae)$ID <- colData(mae)$Sample and then remove the following line entirely: colData(mae)$ID <- stringr::str_replace_all(colData(mae)$ID,"-","_")