cstoeckert / iterativeWGCNA

Extension of the WGCNA program to improve the eigengene similarity of modules and increase the overall number of genes in modules.
GNU General Public License v2.0
59 stars 17 forks source link

iterativeWGCNA tends to find more modules on data sets with low samples amount #27

Open avkitex opened 6 years ago

avkitex commented 6 years ago

image

avkitex commented 6 years ago

Geo datasets, different platforms, --wgcnaParameters maxBlockSize=10000,corType=bicor,power=6,minModuleSize=100

fossilfriend commented 6 years ago

This is interesting but hard to comment on without knowing more about your data. That being said, this result is not that surprising. Think of WGCNA as you would ordination methods (e.g., PCA). The more samples, the more robust the result and the less sensitive it should be to outliers. With smaller datasets single samples can be responsible for a large proportion of the overall variance in the dataset -- leading to (over)splitting. In one of our datasets, this allowed us to actually identify a contaminant signature that could then be filtered and the analysis rerun.

avkitex commented 6 years ago

As you can see here, analysis of the data sets with less than 25 samples often results in more than 20 modules. It is very likely to be oversplitting.

However some other relatively big data sets showed 1-3 modules which seems strange to me. Does these data indicate that optimal samples amount is >25?

avkitex commented 6 years ago

This question is connected with the second one (#28) So you can see for example data set GSE53625 splitted into cancer and normal samples on the right of the picture (179 samples each paired analysis spitted into 2 data sets) (normal has 19 modules and cancer has 8). All samples from this data set were normalized in bulk. How could it affect iterativeWGCNA analysis? Is it OK to see twice amount of modules in normal tissue versus cancerous tissue? Normal membership merged-0.05-membership.txt Tumor membership merged-0.05-membership.txt

fossilfriend commented 6 years ago

You are covering a bit of ground with your post. In this response I'm going to just address the specifics of iterativeWGCNA. I will follow up later addressing your specific questions about GSE53625 later.

Optimal Sample Sizes

No, your results do not inform on an optimal sample size for running iterativeWGCNA; they are a reflection of some characteristic (possibly heterogeneity) of the datasets

We have successfully run iterativeWGCNA on a dataset with 11 samples and obtained biologically meaningful results. In this particular dataset we detected ~80 modules. We have had other users successfully run the algorithm with >200 samples, resulting in 50+ modules and another with 20 samples, resulting in 30 modules, another of the same size with >100 modules. The number of detected modules is a reflection of the data, not the number of samples.

In general, however I would not recommend running the algorithm with fewer than 10 samples as it becomes awkward to put a statistical confidence on the correlations when the sample size is too small.

Oversplitting

Every module detected by iterativeWGCNA is comprised of a group of coherently expressed genes.

There are no garbage bin clusters (i.e., no clusters of genes whose only similarity is that they are different from everything else), outside of the "UNCLASSIFIED" gene set.

Whether the smaller detected modules are biologically meaningful needs to be addressed on case-by-case basis. As I mentioned earlier, in one of our test cases (which we plan to address in the paper), these modules detected a very real biological contaminant in a subset of our samples.

However, If you feel that for your particular dataset these modules are simply splitting the data based on non-significant replicate-level variation, you can try adjusting some of the parameters, such as selecting a higher cutoff for the WGCNA parameter minModuleSize so that these smaller modules are not detected, increasing the threshold for merging close modules (WGCNA parameter mergeCutHeight) or increasing the stringency (minKMEtoStay). The newest version of the code allows you to rerun the module merge stage with alternative parameter choices.

Number of detected modules

iterativeWGCNA is a refinement of WGCNA.

That means that the result will be, at the very least, what you would get from running WGCNA (with the same WGCNA parameter choices) on your dataset.

In fact, the result of the first iteration (pass1/i1 directory) is exactly what WGCNA (blockwiseModules) function would produce with the --wgcnaParameters you specified. Subsequent passes should only add to that result as they are designed to detect subtle variation occluded by the main signal in the dataset (detected and extracted in the first pass). The only place where iterativeWGCNA can really lose modules is during the final merge stage.

So if you are only detecting 1-3 modules then I assume the algorithm is converging in the first pass after only one or two iterations. This suggests to me that overall there is something about your dataset -- maybe too heterogenous -- that makes it a poor candidate for WGCNA analysis.

jcabanad commented 2 years ago

Hi,

I have run your software with a dataset that I already have used in WGCNA, with exactly the same parametres. In WGCNA I obtained 29 modules and in your software only 2. Could you give me some advice to solve this issue?

This is the command I have used:

python run_iterative_wgcna.py -i /home/juditc/ADHD/GWAS_TDAH/TDAH/WGCNA/iterativeWGCNA/data_WGCNA_micro/datExpr_trans.txt --wgcnaParameters power=4,TOMType=unsigned,minModuleSize=30,reassignThreshold=0,mergeCutHeight=0.25,numericLabel=TRUE,pamRespectsDendro=FALSE,maxBlockSize=20000,networkType=signed -o /home/juditc/ADHD/GWAS_TDAH/TDAH/WGCNA/iterativeWGCNA/results_WGCNA_micr2/ -v

Best regards,

Judit

fossilfriend commented 2 years ago

Judit-

Unfortunately, I have to fall back on my standard response that this sort of issue is hard to troubleshoot w/out seeing the actually data. And this is a bit unusual -- most folks are worried that they get too many modules with iterativeWGCNA, while you are having the opposite concern. Without seeing your data, three possible explanations come to mind:

  1. Your data are noisy and/or there are a lot of "background" genes. Maybe some outliers (problematic samples or contaminant genes) led to oversplitting in WGCNA and to more genes being filtered out in iterativeWGCNA. Standard WGCNA (which keeps all genes) may oversplit the data into modules containing genes that aren't very similar to each other when there are issues w/the data.

  2. Parameter choices are affecting the clustering: e.g., a soft power threshold of 4 seems a rather unusual choice; its been my experience most folks go for the default 6 or something higher based on their data. Your mergeCutHeight is also pretty steep. Why these values? It is possible that the effects of your parameter choices are amplified in iterativeWGCNA as it refines the clustering through multiple iterations.

  3. My guess: some combination of 1 & 2

The following are some suggestions for trying to figure out what is going on:

How many genes were in those 2 modules?

Check your logs / summary files

the first iteration of iterativeWGCNA should give you almost (there is a bit of randomness) the same result as you would get from running WGCNA with the same parameters

jcabanad commented 2 years ago

Hi,

Thank you very much for your kind and quick answer. I'll try to answer all your questions and try to figure out what is going wrong.

  1. Regarding the first question: using WGCNA pipeline I have checked that I don't have either genes nor sample outliers before networks construction. However, it is possible that I have some "background" genes, because I didn't filter microarrays data by gene expression before to start the analysis, so I have 19004 genes in my starting point. I've seen in this tutorial that this is not an essential step so I decided to keep all genes but maybe is something I can improve (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/index.html).

  2. I've used the parametres that the developers suggest in the same tutorial I mentioned before. As far as I've seen in literature the mergeCutHeight threshold I used is quite common. Regarding power I choose this threshold based on the soft-thresholding function in WGCNA.

3.I've checked the sample size of the identified modules and only 259 genes where located to modules, the remaining 18752 genes aren't assigned to any module. I've double checked and 19004 genes have been included in the analysis, so it is not a problem of reading the file.

  1. I haver revised wgcna-membership.txt (membership before pruning) -- And here I only have 2 modules there.

Additionally I have tried to change some parametres, like reduce 'minKMEtoStay' and 'minCoreKME' to 0.7 and the minimun number of genes from 30 to 20. It increases a little bit the number of modules but most of genes are not included to any module.

What I found more strange is that these results are very different compared with WGCNA. I'll continue working on that, I'll be very greatful if you can give me any advice.

Best,

Judit