BaselAbujamous / clust

Automatic and optimised consensus clustering of one or more heterogeneous datasets
Other
161 stars 36 forks source link

Unexpected clustering behavior #31

Closed taylorreiter closed 5 years ago

taylorreiter commented 5 years ago

Hello! I'm working with time series gene expression data from a metatranscriptome (~10 species). I have biological duplicate gene expression data from 5 days from each of 15 sites. In this case, I'm trying to use clust to see what similarities there are between all samples, so I treated all sites as biological replicates (30 samples from day 1, 30 samples from day 2, ... , 30 samples from day 5).

If i cluster ~38k genes from about 10 species, I get this first set of clustering. Clusters_profiles.pdf

I was suspicious of cluster 0, so I subsetted by gene expression data to only genes in the first cluster and re-ran clust. This produced this result. Clusters_profiles.pdf

Why are these profiles grouped together in the first place? Is it possible to make clust more stringent so that these profiles are not grouped together?


Normalization: I chose to normalize by data using edgeR. I ran the following:

library(edgeR)

counts <- read.csv("outputs/counts/all_counts.csv", row.names = 1)
y <- DGEList(counts = counts)
keep <- rowSums(cpm(y)>1) >= 2
y <- y[keep, , keep.lib.sizes=FALSE]
head(y$counts)
y$samples
dim(y$counts)
y <- calcNormFactors(y)

norm_counts <- cpm(counts, normalized.lib.sizes = FALSE)
head(norm_counts)
write.csv(norm_counts, "sandbox/clust/edgeR_cpm.csv", quote = F)

I then ran clust like this:

clust -o all_out_edger -n 101 3 4 -r edger-reps.txt edgeR_cpm.csv 

I get similar results when I do not use cpm data, i.e.:

clust -o all_out_ -r reps.txt all_counts.csv 

I am using clust Version v1.8.12

BaselAbujamous commented 5 years ago

Hi

Thanks for using Clust and for your question. Here are couple of issues related to your question:

1. Normalisation issues: If you don't tell Clust which normalisation techniques it should apply, it will by default automatically detect the best normalisation technique and applies it. Whenever automatic normalisation is to be applied by Clust, the first normalisation technique to be applied is quantile normalisation, which transforms all samples/replicates distributions to be equal to each other. Quantile normalisation makes sense when you submit an entire transcriptome to it, as it is biologically assumed that the entire transcriptome as a whole will always have the same distribution of expression values. So, when you submitted all of the ~38K genes, Clust decided to do quantile normalisation first and then whatever other normalisation techniques that your data seemed to need, and this gave you reasonable clusters. However, when you subsetted C0 ONLY to Clust, it also applied quantile normalisation to that subset, and here is where the problem started. All C0 genes are down-regulated, which means that all of them have relatively high expression in the first sample (day1) and low expression in the last sample (day5). But quantile normalisation will assume that this shift in distribution between the first and the last samples is a data production bias and will "think" it is fixing it by shifting those distributions back to be equal. Therefore the clusters generated by applying Clust to C0 genes in this way with quantile normalisation are simply artefacts!

Nonetheless, if you still want to apply clustering to this subset of genes, force Clust NOT TO USE quantile normalisation over this subset. The quantile normalisation code is 101, so don't use it over the C0 subset. You can take the C0 already-normalised profiles from the file that Clust generated and dumped in the Results folder within the Processed_Data sub-folder. These will be ready to be reclustered without any further normalisation. So use the parameter -n 0 to tell Clust that no normalisation should be applied to this data. This would generate sub-clusters of C0 in a way that is numerically correct.

2. The tightness parameter If you like to ask Clust, in the first place, to produce more stringent clusters, use the tightness parameter -t, which is set to 1.0 by default. If you want tighter clusters (more stringent clusters), use larger values of -t, for example -t 5 or -t 10. If you want looser clusters (less stringent), use smaller values, for example -t 0.5 or -t 0.1.

I hope this helps. But please let me know if this still doesn't solve your problems, and please feel free to come back to me with further questions.

P.S. I like the look of the clusters generated for your data. They look nice.

All the best Basel

taylorreiter commented 5 years ago

Thank you so much for your response! This was really helpful, and I now understand the results I got. I varied the tightness parameter and was able to come away with results I was happier with (e.g. tighter clusters).

I'm curious about another comment you made, however: "Quantile normalisation makes sense when you submit an entire transcriptome to it, as it is biologically assumed that the entire transcriptome as a whole will always have the same distribution of expression values."

Since I am working with a metatranscriptome, and in my case I know that some organisms are more active on day 1 than any other day, do you have an intuition for whether my data might violate the assumption of quantile normalization (that the entire transcriptome as a whole will always have the same distribution of expression values)? If so, do you have a recommendation for which normalization method I should use?

BaselAbujamous commented 5 years ago

Great to hear you are getting some results that you are happier with.

Regarding quantile normalisation, you are right. If you expect the entire transcriptome at some time point to be naturally shifted compared to the entire transcriptome at some other time points, quantile normalisation might not be the best thing to apply. You still apply z-scores (normalisation code 4), and you still take the log of the data before z-scores if needed (normalisation code 3). There is no other built-in normalisation techniques in Clust that I can thing of being relevant to your situation. Looking forward to seeing your results (probably published), and please come back with any other questions :)

All the best, Basel

taylorreiter commented 5 years ago

Thank you for all of your help! I look forward to writing up my results :)