cstoeckert / iterativeWGCNA

Extension of the WGCNA program to improve the eigengene similarity of modules and increase the overall number of genes in modules.
GNU General Public License v2.0
59 stars 17 forks source link

questions on parameters #25

Closed gudeqing closed 6 years ago

gudeqing commented 6 years ago
  1. Is the soft-thresholding 'power' self-determined by iterativeWGCNA. From my comprehension of iterativeWGCNA, it seems that the parameter is left defaulted but not self-determined. I really hope this parameter could be self-determined for simplicity. At least, 'pickSoftThreshold' should be wrapped here. In addtion, the parameter 'mergeCutHeight' should also be determined with some add of cluster tree information.
  2. For large dataset, say 50000 genes, with 15 threads, how much time will it take using iterativeWGCNA?
fossilfriend commented 6 years ago

Soft-thresholding 'power'

The power parameter is defaulted to 6; WGCNA's default for a non-signed network. However, you can alter its value; see the section on setting WGCNA parameters in the README.

iterativeWGCNA is not a Python wrapper for the WGCNA R package in its entirety; if you wish to run the pickSoftThreshold function, you will need to do that directly in R.

iterativeWGCNA was designed to handle datasets that fail to meet the assumption of scale free topology, even with a very high soft-thresholding power. In these cases, the pickSoftThreshold function is not particularly useful and most folks either use the WGCNA defaults (power=6 for unsigned networks; power=12 for signed) or follow the suggestions on the WGCNA FAQ, section 6. When developing the iterativeWGCNA algorithm, we did experiment a bit with self-determining the parameter as the network topology changes, but saw little in the way of improved clustering over the defaults/suggestions. The only parameter that really significantly affects iterativeWGCNA performance is the kME similarity cut off, minKMEtoStay.

Large datasets (50,000 genes)

The run-time for iterativeWGCNA depends on a several factors, only one of which is the number of genes. Because the algorithm runs until convergence conditions are met, run times will vary greatly as some datasets may reach convergence faster than others. The efficiency of WGCNA is limited by the computation of the topological overlap matrix because R does not take advantage of multi-threading when doing matrix multiplication. So, with your large dataset size, it will depend on the block size that you specify in the wgcnaParameters. Blocks of 15,000 genes may take as much as a hour each to compute the topological overlap; but I I think that WGCNA analyzes blocks in parallel when threads are enabled.

The WGCNA FAQ makes a suggestion for speeding up these calculations (please note this is several years old; there maybe better solutions available now; I'd check the R manual):

When constructing a network from a data set of a typical genomic size (i.e., between 10 000 and 30 000 genes or other variables), the most time consuming step is the calculation of Topological Overlap Matrix which involves multiplying matrices with tens of thousands of rows and columns. With a standard R distribution, this may take multiple hours even on a modern workstation since matrix multiplication in standard R does not take advantage of multi-threading (parallel execution). It is possible to speed up this process by a factor of 10-100 by installing a speed-optimized Basic Linear Algebra Subprograms (BLAS) library and compiling R against it. The process of compiling R against an enhanced BLAS library is described in the R installation and adminitration manual. Compiling R on Linux and Unix flavors is usually relatively simple and straightforward. On Mac OSX and (more so) on Windows it requires installing additional tools and packages. Although it is helpful to have administrator privileges to compile and install R, it is usually not necessary. See the R installation and administration manual for full details.

gudeqing commented 6 years ago
  1. Thanks for your kindly reply, and I have to methion that 'pickSoftThreshold' in WGCNA will give a suggested 'Power ' , and it is usually effective.
  2. I was surprised and happy to see the sentence "iterativeWGCNA was designed to handle datasets that fail to meet the assumption of scale free topology, even with a very high soft-thresholding power. " . So, does this mean differentially expressed genes (the number is usually small) are also suitable for iterativeWGCNA analysis? As you know, the author of WGCNA do not suggest doing so.
  3. If the basic assumption of scale free topology is ignored, then does WGCNA still holds?
fossilfriend commented 6 years ago

The pickSoftThreshold function will always make a recommendation; even in cases when scale-independence cannot be approximated (i.e., the plot of threshold v r2 never levels off and/or r2's > 0.8 cannot be reached). In these cases, even though power-law weighting does not impose scale-free topology on the network, it should still improve the clustering as it will down-weigh weaker correlations, reducing network connectivity.

Differentially expressed genes

I have not evaluated the performance of iterativeWGCNA on a dataset filtered for deferentially expressed genes so cannot say for certain whether it would out perform WGCNA.

The authors of WGCNA do not recommend running WGCNA on a gene set filtered for differential expression, expecting that the result will be a limited number (likely one) of highly correlated modules because the sub-network will not be scale-free. So it is possible that iterativeWGCNA will outperform WGCNA, as it is better at detecting nested modules. However, I would also argue that how well either algorithm would work depends on the experimental design and the data itself. For example, in a time series, genes consistently regulated (deferentially expressed) across one transition may vary in their expression across the rest of the sampled time points, yielding a sub-network for which scale-independence can be approximated. When comparing one tissue type to another, however, I would side with the authors of WGCNA and argue against filtering by differential expression.

One motivation for iterativeWGCNA was to bypass the recommended step of a priori filtering an expression dataset to improve the clustering (e.g., typical filters by variance, mean expression etc.). Instead, iterativeWGCNA does it own filtering via goodness of fit tests, leaving you at the end with residuals to the classification.

An approach we have taken in our own analyses is to perform iterativeWGCNA on the unfiltered dataset and then evaluate whether sets of deferentially expressed genes are non-randomly distributed among the modules.

Ignoring the scale-free assumption

Lack of scale-free topology does not invalidate the data or preclude applying WGCNA; you just need to be aware that it will not perform well; high network connectivity leads to poorly resolved clusters grouping genes with non-coherent expression. Improving on this was the primary motivation for developing iterativeWGCNA.