BaselAbujamous / clust

Automatic and optimised consensus clustering of one or more heterogeneous datasets
Other
160 stars 35 forks source link

Too many missing genes #43

Closed wyim-pgl closed 4 years ago

wyim-pgl commented 5 years ago

Hi Basel,

I ran with 10686 genes and all my interesting genes are missing. Is there anyway to run it without filtering? Also do you have any idea to add optimal K number ? For example from mclust. Thanks. Won

clust Data/ -r Replicates.txt -n Normalisation.txt -cs 5

| Clust received 2 datasets with 10686 unique genes. After filtering, | | 10686 genes made it to the clustering step. Clust generated 9 clusters | | of genes, which in total include 829 genes. The smallest cluster | | includes 16 genes, the largest cluster includes 270 genes, and the | | average cluster size is 92 genes.

BaselAbujamous commented 5 years ago

Hi

Thanks for your question. From many examples that I have seen, some genes of interest genuinely do not co-express with many other genes to form a cluster. Many of these observations suggest that they might be co-operating at the proteomic level rather than the transcriptional regulatory level.

However, to get clusters that are less strict, you can reduce the value of the -t parameter. By default, -t is 1.0; if you set it to smaller values (e.g. 0.5 or 0.1 or even 0.0) you get clusters with larger numbers of genes in them but less tight. If you use larger values of -t (e.g. 2.0, 5.0, or 10.0), you get tighter clusters. Try adding to your running command -t 0.5 for example and see if this solves your problem.

Regarding forcing clust with a fixed K value, it is not an option as per how clust operates. I wouldn't call the K value from mclust as "absolute optimum", but rather it is "optimum according to the criteria of mclust". Similarly, clust aims at identifying the optimum K values automatically as per its criteria. It is hard to claim an absolutely optimum K values a priori unless there is a very strong evidence from the domain knowledge or if the data was synthesised as such. Many other clustering algorithms automatically identify their "optimum K value" such as WGCNA, MCA, and cross-clustering. Which K value is the correct one? The answer is that each one of them is correct as per the criteria that were used to find it.

The philosophy of clust is that it tries to extract optimum clusters that are "really co-expressed", that is, their expression profiles are highly correlated, out of noisy datasets. My opinion is that this should reduce the loads of false positives you get from clustering algorithms that generate large but loose clusters, which might not be seen as co-expressed by manual inspection.

I hope that this discussion helps you in what you are trying to achieve.

All the best and please feel free to come back to me with any further questions :)

Basel