Trouble with Reproducing Results in Publication

I am trying to reproduce the results from the publication: "netDx: interpretable patient classification using integrated patient similarity networks" (https://www.embopress.org/doi/full/10.15252/msb.20188497). Namely, the part of the paper that deals with classifying samples as Luminal A or Other using a pathway level features (it is under the section "Pathway-level feature selection identifies cellular processes predictive of clinical condition").

I am encountering an issue when I run buildPredictor() where the output's "predictions" slot of the split data is a data frame with zero rows. While buildPredictor completes successfully, running getResults() returns error that is caused by this issue:

Error in ROCR::prediction(dat[, idx1] - dat[, idx2], dat$STATUS == predClasses[1]) :
  Number of classes is not equal to 2.
ROCR currently supports only evaluation of binary classification tasks.

The gene expression data is downloaded from here: https://gdc.cancer.gov/about-data/publications/brca_2012

Below is my code, here are direct links to the data: https://api.gdc.cancer.gov/data/3a4451c4-fd0d-4c1e-be49-4fb88bd1beaf https://api.gdc.cancer.gov/data/af7e82cd-acae-4474-b644-9e8e55acfaf1 http://download.baderlab.org/EM_Genesets/February_01_2018/Human/symbol/Human_AllPathways_February_01_2018_symbol.gmt

Create MultiAssayExperiment from data


library(netDx)
library(MultiAssayExperiment)

sample.info <- read.delim("BRCA.datafreeze.20120227.txt",stringsAsFactors=F)
gene.expr <- read.delim("BRCA.exp.348.med.txt",stringsAsFactors = F,row.names=1,check.names = F)
# alter sample names to match the names in the annotation file
names(gene.expr) <- substr(names(gene.expr),1,16)
row.names(sample.info) <- sample.info$Sample

# filter data to only include the samples with gene expression, one sample per patient and order them the same way as the gene expression
sample.info <- sample.info[sample.info$Sample %in% names(gene.expr) &
                             !duplicated(sample.info$bcr_patient_barcode),]
gene.expr <- gene.expr[,match(sample.info$Sample,names(gene.expr))]
gene.expr.list <- list(gene.expr=as.matrix(gene.expr))

mae <- MultiAssayExperiment(experiments=gene.expr.list,
    colData=sample.info)

Prepare data for netDx. Create sample group and sample ID variables

pam50 <- colData(mae)$PAM50
pam50[which(!(pam50 %in% "LumA"))] <- "notLumA"
pam50[which(pam50 %in% "LumA")] <- "LumA"
colData(mae)$pam_mod <- pam50
table(pam50)

# remove hyphens from patient ID
colData(mae)$ID <- colData(mae)$bcr_patient_barcode
colData(mae)$ID <- stringr::str_replace_all(colData(mae)$ID,"-","_")
colData(mae)$STATUS <- colData(mae)$pam_mod

Prepare pathways to group genes

pathway.dat <- readLines("Human_AllPathways_February_01_2018_symbol.gmt")
names(pathway.dat) <- sapply(pathway.dat,function(x) {
  s <- strsplit(x,"\t")
  return(s[[1]][1])
})

pathway.dat.list <- lapply(pathway.dat,function(x) {
  s <- strsplit(x,"\t")
  s <- s[[1]][3:length(s[[1]])]
  #only keep pathway genes that are in the gene expression data
  s <- s[s %in% row.names(assays(mae)[["gene.expr"]])]
  return(s)
})
#remove empty pathways
pathway.dat.list <- pathway.dat.list[sapply(pathway.dat.list,length)>0]

# Rename pathways to unique identifier:
names(pathway.dat.list) <- paste0("pathway_",1:length(pathway.dat.list))
grouplist <- list("gene.expr"=pathway.dat.list)

Define patient similarity measure. Only need one for gene expression

sims <- list("gene.expr"="pearsonCorr")

Build Predictor

nco <- round(parallel::detectCores()*0.50) # use 50% available cores
message(sprintf("Using %i of %i cores", nco, parallel::detectCores()))

outDir <- paste0(getwd(),"/luminalA_paperrep")
if (file.exists(outDir)) unlink(outDir,recursive=TRUE)

t0 <- Sys.time()
model <- buildPredictor(
    dataList=mae,          ## your data
    groupList=grouplist,    ## grouping strategy
    sims=sims,
    outDir=outDir,          ## output directory
    trainProp=0.8,          ## pct of samples to use to train model in
                                ## each split
    numSplits=1L,            # just do one split in order to test
    featSelCutoff=9L,       ## threshold for calling something
                                ## feature-selected
    featScoreMax=10L,    ## max score for feature selection
 numCores=nco,            ## set higher for parallelizing
 debugMode=FALSE,
 keepAllData=F 
  )
t1 <- Sys.time()
print(t1-t0)

Predictor started at:
2022-05-13 12:33:23
-------------------------------
# patients = 348
# classes = 2 { notLumA,LumA }
Sample breakdown by class
154194
1 train/test splits
Feature selection cutoff = 9 of 10
Datapoints:
        gene.expr: 17814 units
Similarity metrics provided:
$gene.expr
[1] "pearsonCorr"

-------------------------------

-------------------------------
Train/test split # 1
-------------------------------
         IS_TRAIN
STATUS    TRAIN TEST
  LumA      123   31
  notLumA   155   39
# values per feature (training)
        Group gene.expr: 17814 values
** Creating features
Making nets from sims
        gene.expr
Layer gene.expr: PEARSON CORR
Pearson similarity chosen - enforcing min. 5 patients per net.
Net construction complete!
** Compiling features

** Running feature selection
        Class: notLumA

nonpred notLumA    <NA>
    123     155       0
        Scoring features
        Writing queries:

                155 IDs; 10 queries (139 sampled, 16 test)
                Q1: 16 test;  139 query
                Q2: 16 test;  139 query
                Q3: 16 test;  139 query
                Q4: 16 test;  139 query
                Q5: 16 test;  139 query
                Q6: 16 test;  139 query
                Q7: 16 test;  139 query
                Q8: 16 test;  139 query
                Q9: 16 test;  139 query
                Q10: 11 test;  144 query
Java 8 detected
QueryRunner time taken: 28.5 s
        Compiling feature scores

        Class: LumA

   LumA nonpred    <NA>
    123     155       0
        Scoring features
        Writing queries:

                123 IDs; 10 queries (110 sampled, 12 test)
                Q1: 12 test;  111 query
                Q2: 12 test;  111 query
                Q3: 12 test;  111 query
                Q4: 12 test;  111 query
                Q5: 12 test;  111 query
                Q6: 12 test;  111 query
                Q7: 12 test;  111 query
                Q8: 12 test;  111 query
                Q9: 12 test;  111 query
                Q10: 15 test;  108 query
Java 8 detected
QueryRunner time taken: 24.3 s
        Compiling feature scores

** Predicting labels for test
notLumA
        2500 feature(s) selected
        notLumA: Create & compile features
        Filter set provided
                gene.expr: 2500 of 3151 nets pass
Making nets from sims
        gene.expr
Layer gene.expr: PEARSON CORR
Pearson similarity chosen - enforcing min. 5 patients per net.
Net construction complete!
        ** notLumA: Compute similarity
LumA
        2500 feature(s) selected
        LumA: Create & compile features
        Filter set provided
                gene.expr: 2500 of 3151 nets pass
Making nets from sims
        gene.expr
Layer gene.expr: PEARSON CORR
Pearson similarity chosen - enforcing min. 5 patients per net.
Net construction complete!
        ** LumA: Compute similarity

** Predict labels
Split 1: ACCURACY (N=0 test) = NaN%

----------------------------------------
Predictor completed at:
2022-05-13 12:40:48

Get results

results <- getResults(
    model,
    unique(colData(brca)$STATUS),
    featureSelCutoff=1L,
    featureSelPct=0.70
  )

Detected 1 splits and 2 classes
* Plotting performance
Error in ROCR::prediction(dat[, idx1] - dat[, idx2], dat$STATUS == predClasses[1]) :
  Number of classes is not equal to 2.
ROCR currently supports only evaluation of binary classification tasks.

Examine the buildPredictor() output

model[[2]][["predictions"]]

 [1] ID                             Sample
 [3] bcr_patient_barcode            Tissue
 [5] PAM50                          Triple.Negative
 [7] Batch                          Histology
 [9] UNC.microarray                 BCGSC.miRNAseqs
[11] Broad.CN                       Methylation
[13] WU.seq.20111103                RPPA
[15] Integration.Freeze             Integration.RPPA.Freeze.batch1
[17] pam_mod                        STATUS
[19] TT_STATUS                      notLumA_SCORE
[21] LumA_SCORE                     PRED_CLASS
<0 rows> (or 0-length row.names)

Also, I am able to run the vignette successfully.

BaderLab / netDx

Trouble with Reproducing Results in Publication #91