Output representative expression profiles of the clusters

apcamargo commented 5 years ago

Hi Basel,

In many cases, it's very useful to use a prototypical expression profile of the clusters in downstream analysis (by measuring it's correlation to an external variable, for instance). In WGCNA, the eigengene of the modules are usually used for this purpose.

It would be useful if Clust could output some kind of representation of the expression profile of each cluster. It could be the eigengene, median expression for each sample, trimmed mean etc.

What do you think?

BaselAbujamous commented 5 years ago

Hi Antoni,

I agree. I usually use mean expression. Data is all available in the output for people to generate this, but it would be nice to provide them with that ready in a separate TSV file. I will consider this in future versions. I will keep this issue open until then.

P.S. Thanks a lot for your recent edits (fixing requirements in README and changing transparency of the plots). I have merged them and they will be part of the next version of the pip-installed package. I liked the idea. Thanks for the contribution.

Basel

apcamargo commented 5 years ago

Do you think the mean expression in each condition is a good option? I imagine that there won't be many outlier values (as the expression needs to be at least similar to the cluster profile), but I feel that the average isn't robust enough.

Using trimmed means or medians seems better to me (I might be mistaken). I don't know if the eigengene is robost to ouliers, but I think we can investigate it.

(You're welcome! I really appreciate the effort you put into Clust and I'm willing to help you from a user perspective.)

BaselAbujamous commented 5 years ago

Maybe trimmed means makes sense. As the algorithm aims at taking out any outliers anyway, trimmed mean and normal mean would be similar. To be on the safe side, I would use the trimmed mean approach as you suggested.

Your help is much appreciated by ideas or even by direct edits, indeed.

apcamargo commented 5 years ago

I did a quick experiment here. I got the values of the C1, C2 and C3 clusters from the D1 dataset and computed representative profiles using four methods: eigengene, mean, trimmed mean and median. I then calculated the sum of the absolute differences between the representative profiles and the true values.

It seems that taking the median was the best strategy (median > trimmed mean > mean > eigengene). I may be computing the eigengene wrong, tough.

We could test if that remains true with the D2 and D3 datasets.

clust_test_representative_profiles.pdf

What do you mean by "take out outliers"? Do Clust explicitly removes outliers or do you mean that a gene with a outlier value in a given sample simply wouldn't be clustered uring the k-means step?

apcamargo commented 5 years ago

It seems I was computing the eigengenes wrong after all. It looks like eigengenes are by far the best way to build a representative expression profile for the clusters.

eigengene >>>> median > trimmed mean > mean

clust_test_representative_profiles_v2.pdf

apcamargo commented 5 years ago

I did a Python implementation of the eigengene computation (and some plots comparing it to the medians, trimmed means and means).

Python_eigengenes.pdf

BaselAbujamous commented 5 years ago

This is some great effort, Antônio! Thanks a lot!

I can see your point, and I believe I will incorporate that in the next version of Clust! I may test it over some other datasets as (with higher dimensions maybe).

Your input is much appreciated and will definately make using Clust a better experience for users!

Thanks again! Basel

apcamargo commented 5 years ago

You're welcome!

I tested the eigengene in one of mine datasets and it performed better again.

There's one important thing that I didn't leave in those PDFs. The eigengene may be computed with inverse signs relative to the true expression pattern of the coexpression module (that's why I put a minus sign in front of the SVD function). I sent a email to one of WGCNA's developers and he said to me that their function "automatically adjusts the sign so that the resulting module eigengene has a positive correlation with the mean gene expression values of the module".

Here's their code:

        {
          if (verbose>4) printFlush(paste(spaces,
                          " .. aligning module eigengene with average expression."))
          corAve = cor(averExpr[,i], PrinComps[,i], use = "p");
          if (!is.finite(corAve)) corAve = 0;
          if (corAve<0) PrinComps[,i] = -PrinComps[,i]
        }

This should be really easy to implement in Python for Clust. If you want to, I can work on a PR.

BaselAbujamous commented 5 years ago

Sorry for being late in responding. I am totally happy with it if you would like to work on a PR! Thanks.

taylorreiter commented 5 years ago

Hello! Will this be implemented in clust soon? I see #20, and am wondering if it is possible to have this functionality integrated.

BaselAbujamous commented 4 years ago

Sorry for the very late response here. I have just merged the edits by @apcamargo allowing for this capability. Thanks a lot, @apcamargo .

apcamargo commented 4 years ago

Thanks @BaselAbujamous!

BaselAbujamous / clust

Output representative expression profiles of the clusters #16