fogellab / multiWGCNA

an R package for deep mining gene co-expression networks in multi-trait expression data
13 stars 2 forks source link

Preservation p-values #7

Closed GettyScience closed 5 months ago

GettyScience commented 8 months ago

Hello,

I have been working with your program for a number of months and things are going really well. I am in the homestretch for my research and am trying a few things for better results. I have found good results from the preservation analysis, some modules only existing in some groups, based on the plots. However, when I write the results to a sheet I can use later, the p-value used to determine significance is not listed. I can see the preservations score, as well as others, but not the p-value.

Is there a way to export that significance value for the preservation tests? And what is the meaning of the other values given in this export? Are they used to calculate the significance?

Any help is appreciated.

Thank you,

GettyScience commented 8 months ago

Hello, I am just following up on this request. Thank you,

dariotommasini commented 8 months ago

Hi @GettyScience

Sorry for the late reply.

Short answer: You need to run diseasePreservationPTest function. However, I removed it at some point when submitting to Bioconductor because it had some issues. Give me a couple days to add it back to the development release, then please re-install multiWGCNA from github using devtools. I'll add an example in the documentation to make it clear how to run the function.

Long answer: The preservation comparisons function simply runs the WGCNA preservation analysis on the WGCNAObject list supplied. The WGCNA preservation analysis does not report p-values, it reports a number of statistics that are often summarized using the z-summary preservation score. What we and others have done in the past to see if a preservation score is lower or higher than expected by chance is implement a permutation test where you derive the null distribution of preservation scores for the dataset by shuffling the labels of interest. For example, in the multiWGCNA paper, we observe a preservation score of about 9 for our module of interest (dM15 from disease data) in the wildtype data. The question is this: what is the probability of getting the observed preservation score if we randomly split the dataset rather than using the true phenotype labels (disease vs wildtype). It turns out that the expected preservation score for a module the size of dM15 is around 20. In fact, out of the 2000 permutations we ran, only 39 of them were less than or equal to the score we observed, which corresponds to a p-value of 39/2000 = 0.0195. Since each permutation requires a network construction on split1 and a preservation on split2, this is a very slow step, which is why it is not part of the standard multiWGCNA pipeline. I recommend running this analysis on a computing cluster.

Hope that helps

GettyScience commented 8 months ago

Okay, thank you! I was confused when the plot constructed by the preservation analysis contained a p-value, but not the results sent to text. The plot I produced was unhelpful simply due to overlap (so not all modules had a discernible name attached).

dariotommasini commented 7 months ago

Hi @GettyScience,

I'm not sure what p-values you're referring to. Which command generated those plots? Could you upload the plot here?

I apologize, re-implementing the permutation test is going to take me a bit longer than expected because some of the functions I used are deprecated. Rather than slow you down, in the meantime here is the code I used, which could serve as a guide for your analysis:

referenceDatExpr=read.csv(args[1])
design=read.csv(args[2])
nPermutations=500; refColumn=3; testColumn=2

    nCores=15
    registerDoParallel(cores=nCores)
    enableWGCNAThreads(nThreads=nCores)

    datExpr=referenceDatExpr[,match(design$Sample, colnames(referenceDatExpr))]

    wtDatExpr=list()
    dsDatExpr=list()
    WGCNAobjects=list()
    preservationData=list()
    filteredPreservationData=list()

    for(permutation in 1:nPermutations){
        #assign phenotype labels randomly
        WT.indices=list()
        conditions=unique(design[, refColumn])
        for(condition in conditions){
            conditionalDesign=design[design[, refColumn]==condition,]
            nSamples=nrow(conditionalDesign)
            nWT=nrow(conditionalDesign[conditionalDesign[, testColumn]=="WT",])
            sampleIndices=which(design[, refColumn]==condition)
            WT.indices=append(WT.indices, sampleIndices[sample(1:nSamples, nWT, replace=F)])
        }
        WT.indices=unlist(WT.indices)
        randomHealthy=datExpr[, WT.indices] 
        randomHealthy=data.frame(X=referenceDatExpr$X, randomHealthy)
        randomDisease=datExpr[, !(1:ncol(datExpr) %in% WT.indices)]
        randomDisease=data.frame(X=referenceDatExpr$X, randomDisease)

        #perform WGCNA
        dir.create("WGCNAs")
        setwd("WGCNAs")
        wtDatExpr[[permutation]] <- runWGCNA_minimal(randomHealthy, paste0("WtP", permutation), softPowerCalculation=F)
        dsDatExpr[[permutation]] <- runWGCNA_minimal(randomDisease, paste0("DsP", permutation), softPowerCalculation=F)
        setwd("..")
        gc()

        #perform preservation
        dir.create("preservation")
        setwd("preservation")
        preservationData[[permutation]]=list()
        preservationData[[permutation]][[1]] <- getPreservation(wtDatExpr[[permutation]], dsDatExpr[[permutation]], write=T)
        preservationData[[permutation]][[2]] <- getPreservation(dsDatExpr[[permutation]], wtDatExpr[[permutation]], write=T)
        setwd("..")
        save.image("presPermutation.RData")
        WGCNAobjects[[permutation]]=findOutlierModules(findModuleEigengenes(dsDatExpr[[permutation]]))
        filteredPreservationData[[permutation]]=preservationData[[permutation]][[2]][!rownames(preservationData[[permutation]][[2]]) %in% WGCNAobjects[[permutation]]@outlierModules,]
        dir.create("filteredPreservation")
        write.csv(filteredPreservationData[[permutation]], paste0("filteredPreservation/filt", permutation, ".csv"), row.names=F)
        gc()
        save.image("presPermutation.RData")
    }

z.summary.dist=list()
module.size=list()
for(permutation in 1:nPermutations){
    myPerm=filteredPreservationData[[permutation]]
    z.summary.dist[[permutation]]=myPerm$Zsummary[which.min(abs(myPerm$moduleSize-297))]
    module.size[[permutation]]=myPerm$moduleSize[which.min(abs(myPerm$moduleSize-297))]
}
summary=cbind(z.summary.dist, module.size)
write.csv(summary, "round6_summary.csv", row.names=F)

# Permutation p-value
z.summary.dist=unlist(z.summary.dist)
below=length(z.summary.dist[z.summary.dist<9.16261490617938])
probability= below/nPermutations
probability

save.image("presPermutation.RData")
GettyScience commented 7 months ago

I used the preservation comparison plot function. The output is plots similar to below:

condition1

dariotommasini commented 7 months ago

To clarify, the p-values on the x axis are derived from the standard WGCNA pipeline e.g. to run a correlation test between your input conditions and the module eigengene. They do not refer to any preservation aspect of the module. They only tell you whether the module is associated with the conditions or not. The preservation score is on the y-axis and that tells you if the module is preserved (Z > 10), weakly preserved (10 > Z > 2) or not preserved (Z < 2). The purpose of this plot is to quickly tell if any trait-associated modules are differetially preserved. For instance, your module 7 on the left might be.

Do make sure you increase the number of permutations, as the vignettes use few permutations for computational reasons. At least 50-100 permutations should be performed for accurate Z-summary preservation statistics.

GettyScience commented 7 months ago

Each time I use 1000 permutations for this test. I am still unable to use the data at times because in my more complex tests we have many modules that are differentially preserved and overlapping. So the number label gets lost.

To clarify, though, the p-value relates to whether or not the module matches a treatment/genotype? Why when I run the preservation comparison plot for my two genotypes does it only show "treatment" modules and when I run the treatment vs control version of this preservation code it only shows "pgm" (one of the genotypes)?

GettyScience commented 7 months ago
image
dariotommasini commented 7 months ago

Seems like you're getting way too many modules. You probably shouldn't get any more than 50, maybe 60 modules. What parameters are you supplying to constructNetworks? A normal call would be:

astrocyte_networks = constructNetworks(astrocyte_se, sampleTable, conditions1, conditions2, 
                                  networkType = "signed", TOMType = "unsigned", 
                                  power = 12, minModuleSize = 100, maxBlockSize = 25000,
                                  reassignThreshold = 0, minKMEtoStay = 0, mergeCutHeight = 0,
                                  numericLabels = TRUE, pamRespectsDendro = FALSE, 
                                  deepSplit = 4, verbose = 3)

Are you sure your sample size is large enough for analyzing col and pgm separately?

dariotommasini commented 5 months ago

Hi @GettyScience ,

I've finally implemented the PreservationPermutationTest function. If you're still interested in running the permutation test used in the paper, please see the updated astrocyte vignette which now has a detailed example of how to use this new function. Please let me know if there are any issues, as it's still very experimental.