Gibbsdavidl / Immune-Subtype-Clustering

This repo contains the code necessary to reproduce the clusters found in "The Immune Landscape of Cancer".
19 stars 12 forks source link

Different result between iAtlas Tools and notebook #3

Open Tim0thy1 opened 5 years ago

Tim0thy1 commented 5 years ago

Hi,

I detected the immune subtypes using iAtlas Tools and immune subtypes notebook for same data(ebpp_test1_1to20.tsv). But the finalResults are different . How can it be?

result of notebook
         1         2         3         4         5         6 Call

XY1 0.0000000 0.0000000 0.0312500 0.7265625 0.1562500 0.0859375 4 XY2 0.0000000 0.0000000 0.9375000 0.0000000 0.0234375 0.0390625 3 XY3 0.0078125 0.0000000 0.5781250 0.2578125 0.0312500 0.1250000 3 XY4 0.0078125 0.0156250 0.0000000 0.7500000 0.0390625 0.1875000 4 XY5 0.0000000 0.0078125 0.0625000 0.7109375 0.0937500 0.1250000 4 XY6 0.0078125 0.0234375 0.0000000 0.7500000 0.0000000 0.2187500 4 XY7 0.3046875 0.0625000 0.0000000 0.2109375 0.0000000 0.4218750 6 XY8 0.0234375 0.0000000 0.7968750 0.0156250 0.0390625 0.1250000 3 XY9 0.0156250 0.0234375 0.0000000 0.7812500 0.0234375 0.1562500 4 XY10 0.0156250 0.8984375 0.0000000 0.0156250 0.0000000 0.0703125 2 XY11 0.0078125 0.0156250 0.0000000 0.7500000 0.0078125 0.2187500 4 XY12 0.0000000 0.0000000 0.5000000 0.3828125 0.0781250 0.0390625 3 XY13 0.0000000 0.0000000 0.0000000 0.9531250 0.0078125 0.0390625 4 XY14 0.0078125 0.0156250 0.0000000 0.7656250 0.0156250 0.1953125 4 XY15 0.0000000 0.0000000 0.0000000 0.9453125 0.0156250 0.0390625 4 XY16 0.0078125 0.0000000 0.9453125 0.0000000 0.0078125 0.0390625 3 XY17 0.0390625 0.1875000 0.0000000 0.4687500 0.0000000 0.3046875 4 XY18 0.0156250 0.1171875 0.6250000 0.0546875 0.0000000 0.1875000 3 XY19 0.0078125 0.0156250 0.0000000 0.7812500 0.0390625 0.1562500 4 XY20 0.0468750 0.0234375 0.0000000 0.7031250 0.0000000 0.2265625 4

result of tools

image

Gibbsdavidl commented 5 years ago

Hi there, Thank you for letting me know.

You ask "How can it be?" Oh, anything is possible. Is this your own data? Is it something I can download and try?

During the classification, the ensemble of classifiers makes a subtype call for each sample, then there's a step where all the calls must be merged. I think there's some randomness there, if a sample is on the border between clusters, it can bounce back and forth. I used the clue R package for that part.

Or it's a bug. I'll check it out. -dave

On Tue, Apr 16, 2019 at 7:43 PM Tim0thy1 notifications@github.com wrote:

Hi,

I detected the immune subtypes using iAtlas Tools and immune subtypes notebook. But the finalResults are different . How can it be?

result of notebook

1 2 3 4 5 6 Call XY1 0.0000000 0.0000000 0.0312500 0.7265625 0.1562500 0.0859375 4 XY2 0.0000000 0.0000000 0.9375000 0.0000000 0.0234375 0.0390625 3 XY3 0.0078125 0.0000000 0.5781250 0.2578125 0.0312500 0.1250000 3 XY4 0.0078125 0.0156250 0.0000000 0.7500000 0.0390625 0.1875000 4 XY5 0.0000000 0.0078125 0.0625000 0.7109375 0.0937500 0.1250000 4 XY6 0.0078125 0.0234375 0.0000000 0.7500000 0.0000000 0.2187500 4 XY7 0.3046875 0.0625000 0.0000000 0.2109375 0.0000000 0.4218750 6 XY8 0.0234375 0.0000000 0.7968750 0.0156250 0.0390625 0.1250000 3 XY9 0.0156250 0.0234375 0.0000000 0.7812500 0.0234375 0.1562500 4 XY10 0.0156250 0.8984375 0.0000000 0.0156250 0.0000000 0.0703125 2 XY11 0.0078125 0.0156250 0.0000000 0.7500000 0.0078125 0.2187500 4 XY12 0.0000000 0.0000000 0.5000000 0.3828125 0.0781250 0.0390625 3 XY13 0.0000000 0.0000000 0.0000000 0.9531250 0.0078125 0.0390625 4 XY14 0.0078125 0.0156250 0.0000000 0.7656250 0.0156250 0.1953125 4 XY15 0.0000000 0.0000000 0.0000000 0.9453125 0.0156250 0.0390625 4 XY16 0.0078125 0.0000000 0.9453125 0.0000000 0.0078125 0.0390625 3 XY17 0.0390625 0.1875000 0.0000000 0.4687500 0.0000000 0.3046875 4 XY18 0.0156250 0.1171875 0.6250000 0.0546875 0.0000000 0.1875000 3 XY19 0.0078125 0.0156250 0.0000000 0.7812500 0.0390625 0.1562500 4 XY20 0.0468750 0.0234375 0.0000000 0.7031250 0.0000000 0.2265625 4

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Gibbsdavidl/Immune-Subtype-Clustering/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiSyQOyKUNRDmtEiTLY4ic5DRYHXdcUks5vhopLgaJpZM4c0E_M .

Tim0thy1 commented 5 years ago

Hi, No, this is example data. ( https://isb-cgc.shinyapps.io/shiny-iatlas/ ) image or https://github.com/Gibbsdavidl/Immune-Subtype-Clustering/blob/master/ExtraData/ebpp_test1_1to20.tsv

Tim0thy1 commented 5 years ago

I upload my R script there. I copy that code from there

############################R code#########################

library(ggplot2) library(clue) library(mclust) library(parallel) library(stringr)

tcgaSubset <- read.table('/home/data/ebppSubset.tsv.bz2', header = T, sep = '\t', stringsAsFactors = F)

newdata

newDat <- read.table("/home/data/Immune-Subtype-Clustering-master/ExtraData/ebpp_test1_1to20.tsv", sep='\t', header=T, stringsAsFactors = F)

reportedScores <- read.table('~/data/five_signature_mclust_ensemble_results.tsv.gz', sep='\t', header=T, stringsAsFactors = F) rownames(reportedScores) <- reportedScores$AliquotBarcode

load('/home/data/comparative_immuneSigs_geneLists4.rda') source('/home/data/ImmuneSigs68_function.R') source('/home/data/signature_mclust_ensemble.R') load('/home/data/wolf_set_slim1.rda')

zscore.cols2<-function(x){ return((apply(x, 2, function(x) (x - median(na.omit(x)))/sd(na.omit(x))))) }

Data prep

tcga data is already upper quantile normalized

tcgaSubset <- log2(tcgaSubset + 1)

here we'll drop any duplicate genes, and make the rownames

be gene symbols

rownames(newDat) <- newDat[,1] didx <- !duplicated(as.character(rownames(newDat))) dat <- newDat[didx,-1]

TCGA data was upper quantile normalized (at the 75%) and multiplied by 1000

data.quantileExpressed <- apply(dat, 2, function(x){quantile(x[x>0], 0.75)}) datnorm <- as.data.frame(t( t(dat) / data.quantileExpressed ) ) * 1000

datlog2 <- log2(datnorm+1)

joining data sets

sharedGenes <- intersect(rownames(tcgaSubset), rownames(datlog2))

first median scale each data set

newDatSub <- datlog2[sharedGenes,]

getting things in the right order

tcgaSubsetSub <- tcgaSubset[sharedGenes,]

then join the new data and tcga data at the genes

joinDat <- cbind(newDatSub, tcgaSubsetSub)

pre-batch corrected table, for plotting

newdatSamples <- colnames(newDatSub) dat2idx <- 1:(ncol(newDatSub)-1) tcgaidx <- setdiff( (1: (ncol(joinDat)-1)), dat2idx)

sampleIdx <- c(dat2idx, sample(tcgaidx, size=200, replace = F)) preCombat <- joinDat[,sampleIdx] preCombatMelt <- reshape2::melt(preCombat) preCombatMelt$SampleSource <- ifelse(test = preCombatMelt$variable %in% newdatSamples, yes = "New Data", no="TCGA Data")

batch correction

library(sva) combatflag <- 0

if (combatflag) {

then batch correction between scores...

batch <- c(rep(1,ncol(newDatSub)), rep(2,ncol(tcgaSubsetSub)))
modcombat = model.matrix(~1, data=as.data.frame(t(joinDat)))
combat_edata = ComBat(dat=joinDat, batch=batch, mod=modcombat, 
                      par.prior=TRUE, prior.plots=FALSE, ref.batch = 2)

} else { combat_edata = joinDat
}

then look at the distribution of the data after batch correction

postCombat <- combat_edata[,sampleIdx] postCombatMelt <- reshape2::melt(postCombat) postCombatMelt$SampleSource <- ifelse(test = postCombatMelt$variable %in% newdatSamples, yes = "New Data", no="TCGA Data")

getting the medians of each gene

joinMeds <- apply(combat_edata, 1, median, na.rm=T)

here we median center each gene

join_med_scaled <- sweep(combat_edata,1,joinMeds,'-')

ls() rm(dat, dat2idx, data.quantileExpressed, datlog2, datnorm, didx, newDat, newdatSamples, postCombat, postCombatMelt, preCombat, preCombatMelt, sampleIdx, sharedGenes, tcgaidx, tcgaSubsetSub) gc()

library(pryr)

scores <- ImmuneSigs_function(join_med_scaled, sigs1_2_eg2,sigs12_weighted_means, sigs12_module_weights,sigs1_2_names2,sigs1_2_type2)

idx <- c("LIexpression_score", "CSF1_response", "TGFB_score_21050467", "Module3_IFN_score", "CHANG_CORE_SERUM_RESPONSE_UP") scores <- t(scores[idx,]) zscores <- zscore.cols2(scores)

make cluster calls using the models.

calls <- consensusEnsemble(mods2, zscores, 4, 128)

get the top scoring cluster for each sample

maxcalls <- apply(calls$.Data, 1, function(a) which(a == max(a))[1]) names(maxcalls) <- rownames(scores)

then we'll look at the new vs. old cluster calls for TCGA samples

sharedIDs <- intersect(reportedScores$AliquotBarcode, rownames(scores)) t1 <-table(Reported=as.numeric(reportedScores[sharedIDs, 'ClusterModel1']), NewCalls=as.numeric(maxcalls[sharedIDs]))

then we can align the new calls to calls from the manuscript.

reported <- 1:6 optcalls <- 1:6

for (i in reported) {

##for subtype i, where did most of the samples end up?
j <- which(as.numeric(t1[i,]) == max(as.numeric(t1[i,])))
##rename maxcall j <- i
optcalls[i] <- j

}

these are the re-mapped calls

alignedCalls <- sapply(maxcalls, function(a) which(a == optcalls)[1])

make sure it works on the re-called TCGA data

t2 <-table(Reported=as.numeric(reportedScores[sharedIDs, 'ClusterModel1']), NewCalls=as.numeric(alignedCalls[sharedIDs]))

assemble the results

jdx <- match(table=rownames(scores), x=colnames(newDatSub)) # index to new data scores pcalls <- calls$.Data[jdx,] # get that table rownames(pcalls) <- colnames(newDatSub) # name it from the new data pcalls <- pcalls[,optcalls]

pcalls <- cbind(pcalls, data.frame(Call=alignedCalls[jdx])) # bring in the aligned calls pcalls <- cbind(pcalls, zscores[jdx,]) # and the scores

finalResults <- list(AlignedCalls=alignedCalls[jdx], Table=t2, ProbCalls=pcalls)

In the finalResults,

$AlignedCalls are the immune subtypes - aligned with the TCGA manuscript

$Table is the TCGA reported immune subtypes (on TCGA samples) compared to what we just computed

$ProbCalls shows the probability of being in each subtype and signature scores, Call is the aligned call

load('/home/data/colnames_1to20_ebpp.rda') rownames(reportedScores) <- str_replace_all(reportedScores$AliquotBarcode, pattern = '\.', replacement = '-')

manuscriptCalls <- reportedScores[cx, 'ClusterModel1']

newCalls <- finalResults$AlignedCalls

save(finalResults,file="tmp.Robj")

Tim0thy1 commented 5 years ago

Hi there, Thank you for letting me know. You ask "How can it be?" Oh, anything is possible. Is this your own data? Is it something I can download and try? During the classification, the ensemble of classifiers makes a subtype call for each sample, then there's a step where all the calls must be merged. I think there's some randomness there, if a sample is on the border between clusters, it can bounce back and forth. I used the clue R package for that part. Or it's a bug. I'll check it out. -dave On Tue, Apr 16, 2019 at 7:43 PM Tim0thy1 @.**> wrote: Hi, I detected the immune subtypes using iAtlas Tools and immune subtypes notebook. But the finalResults are different . How can it be? #####result of notebook###### 1 2 3 4 5 6 Call XY1 0.0000000 0.0000000 0.0312500 0.7265625 0.1562500 0.0859375 4 XY2 0.0000000 0.0000000 0.9375000 0.0000000 0.0234375 0.0390625 3 XY3 0.0078125 0.0000000 0.5781250 0.2578125 0.0312500 0.1250000 3* XY4 0.0078125 0.0156250 0.0000000 0.7500000 0.0390625 0.1875000 4 XY5 0.0000000 0.0078125 0.0625000 0.7109375 0.0937500 0.1250000 4 XY6 0.0078125 0.0234375 0.0000000 0.7500000 0.0000000 0.2187500 4 XY7 0.3046875 0.0625000 0.0000000 0.2109375 0.0000000 0.4218750 6 XY8 0.0234375 0.0000000 0.7968750 0.0156250 0.0390625 0.1250000 3 XY9 0.0156250 0.0234375 0.0000000 0.7812500 0.0234375 0.1562500 4 XY10 0.0156250 0.8984375 0.0000000 0.0156250 0.0000000 0.0703125 2 XY11 0.0078125 0.0156250 0.0000000 0.7500000 0.0078125 0.2187500 4 XY12 0.0000000 0.0000000 0.5000000 0.3828125 0.0781250 0.0390625 3 XY13 0.0000000 0.0000000 0.0000000 0.9531250 0.0078125 0.0390625 4 XY14 0.0078125 0.0156250 0.0000000 0.7656250 0.0156250 0.1953125 4 XY15 0.0000000 0.0000000 0.0000000 0.9453125 0.0156250 0.0390625 4 XY16 0.0078125 0.0000000 0.9453125 0.0000000 0.0078125 0.0390625 3 XY17 0.0390625 0.1875000 0.0000000 0.4687500 0.0000000 0.3046875 4 XY18 0.0156250 0.1171875 0.6250000 0.0546875 0.0000000 0.1875000 3 XY19 0.0078125 0.0156250 0.0000000 0.7812500 0.0390625 0.1562500 4 XY20 0.0468750 0.0234375 0.0000000 0.7031250 0.0000000 0.2265625 4 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiSyQOyKUNRDmtEiTLY4ic5DRYHXdcUks5vhopLgaJpZM4c0E_M .

Sorry, my English is not very good. I must have mistaken 'How can it be' meaning.

Gibbsdavidl commented 5 years ago

Your english is good. I read your question as “how is that possible?” ... which is fine. They should give mostly the same answer.

But this project has been very difficult for me.

:-)

On Wed, Apr 17, 2019 at 6:33 PM Tim0thy1 notifications@github.com wrote:

Hi there, Thank you for letting me know. You ask "How can it be?" Oh, anything is possible. Is this your own data? Is it something I can download and try? During the classification, the ensemble of classifiers makes a subtype call for each sample, then there's a step where all the calls must be merged. I think there's some randomness there, if a sample is on the border between clusters, it can bounce back and forth. I used the clue R package for that part. Or it's a bug. I'll check it out. -dave … <#m-4400710527188761948> On Tue, Apr 16, 2019 at 7:43 PM Tim0thy1 @.**> wrote: Hi, I detected the immune subtypes using iAtlas Tools and immune subtypes notebook. But the finalResults are different . How can it be? #####result of notebook###### 1 2 3 4 5 6 Call XY1 0.0000000 0.0000000 0.0312500 0.7265625 0.1562500 0.0859375 4 XY2 0.0000000 0.0000000 0.9375000 0.0000000 0.0234375 0.0390625 3 XY3 0.0078125 0.0000000 0.5781250 0.2578125 0.0312500 0.1250000 3* XY4 0.0078125 0.0156250 0.0000000 0.7500000 0.0390625 0.1875000 4 XY5 0.0000000 0.0078125 0.0625000 0.7109375 0.0937500 0.1250000 4 XY6 0.0078125 0.0234375 0.0000000 0.7500000 0.0000000 0.2187500 4 XY7 0.3046875 0.0625000 0.0000000 0.2109375 0.0000000 0.4218750 6 XY8 0.0234375 0.0000000 0.7968750 0.0156250 0.0390625 0.1250000 3 XY9 0.0156250 0.0234375 0.0000000 0.7812500 0.0234375 0.1562500 4 XY10 0.0156250 0.8984375 0.0000000 0.0156250 0.0000000 0.0703125 2 XY11 0.0078125 0.0156250 0.0000000 0.7500000 0.0078125 0.2187500 4 XY12 0.0000000 0.0000000 0.5000000 0.3828125 0.0781250 0.0390625 3 XY13 0.0000000 0.0000000 0.0000000 0.9531250 0.0078125 0.0390625 4 XY14 0.0078125 0.0156250 0.0000000 0.7656250 0.0156250 0.1953125 4 XY15 0.0000000 0.0000000 0.0000000 0.9453125 0.0156250 0.0390625 4 XY16 0.0078125 0.0000000 0.9453125 0.0000000 0.0078125 0.0390625 3 XY17 0.0390625 0.1875000 0.0000000 0.4687500 0.0000000 0.3046875 4 XY18 0.0156250 0.1171875 0.6250000 0.0546875 0.0000000 0.1875000 3 XY19 0.0078125 0.0156250 0.0000000 0.7812500 0.0390625 0.1562500 4 XY20 0.0468750 0.0234375 0.0000000 0.7031250 0.0000000 0.2265625 4 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3 https://github.com/Gibbsdavidl/Immune-Subtype-Clustering/issues/3>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAiSyQOyKUNRDmtEiTLY4ic5DRYHXdcUks5vhopLgaJpZM4c0E_M .

Sorry, my English is not very good. I must have mistaken 'How can it be' meaning.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/Gibbsdavidl/Immune-Subtype-Clustering/issues/3#issuecomment-484321767, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEJFSOFE6TZ7XHQRJYCSCTPQ7FWHANCNFSM4HGQJ7GA .

Tim0thy1 commented 5 years ago

@Gibbsdavidl Thanks for your replay. The website tools and code on R notebook.
Which one do you suggest and why?

Gibbsdavidl commented 5 years ago

I just tried it and got only one difference in cluster label between the notebook and webapp (XY20).

One place that can cause differences is in the selection of 'ensemble size'. The clustering is done using a 'model based method'. And in this case, it was an ensemble of models. That can change the results.

They use the same code... shouldn't be many differences except those due to the random nature of combining across the ensemble.

If you want to run it many times, running it locally would probably be faster.

Thanks!

Tim0thy1 commented 5 years ago

@Gibbsdavidl Can you send your code to me. I want to campare your codes to my. By the way, I can't connect the webapp server all day...

Tim0thy1 commented 5 years ago

@Gibbsdavidl Can you get my message?

Gibbsdavidl commented 5 years ago

Hi there, Sorry about the delay. I just used the notebook in the repo with no changes... "Call_immune_subtypes_on_new_data.ipynb https://github.com/Gibbsdavidl/Immune-Subtype-Clustering/blob/master/Notebooks/Call_immune_subtypes_on_new_data.ipynb "

-dave

On Sun, Apr 28, 2019 at 5:59 PM Tim0thy1 notifications@github.com wrote:

@Gibbsdavidl https://github.com/Gibbsdavidl Can you get my message?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Gibbsdavidl/Immune-Subtype-Clustering/issues/3#issuecomment-487430435, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEJFSK3YEX6TAOHCEIGBA3PSZB53ANCNFSM4HGQJ7GA .

Gibbsdavidl commented 5 years ago

Webapp seems to be working OK today. Thanks for letting us know. -dave

On Tue, Apr 23, 2019 at 12:18 AM Tim0thy1 notifications@github.com wrote:

@Gibbsdavidl https://github.com/Gibbsdavidl Can you send your code to me. I want to campare your codes to my. By the way, I can't connect the webapp server all day...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Gibbsdavidl/Immune-Subtype-Clustering/issues/3#issuecomment-485674280, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEJFSP5SZ6IKWCFUXJVPIDPR2Z3BANCNFSM4HGQJ7GA .

Tim0thy1 commented 5 years ago

That is very strange. I will try the code on notebook again.
We can communicate with E-mail next time.