BaselAbujamous / clust

Automatic and optimised consensus clustering of one or more heterogeneous datasets
Other
160 stars 35 forks source link

pre-filtering for clust #15

Closed benyoung93 closed 5 years ago

benyoung93 commented 5 years ago

Hi Basel

Sorry another really quick question for clust.

So for pre-processing before inputting to clust. For WGCNA (what I was using previously) I was using a quite harsh filtering of cpm > 1 in 90% of my samples. I did this due to queries online, in papers etc.

I was wondering, for clust, does a pre filtering so harsh need to be done? What I am currently leaning towards is a cpm > 1 in each condition due to clust providing incredibly tighter clusters with fewer number of genes (honestly the difference with the cpm >1 in 90% samples between WGCNA and clust is quite staggering).

Again, quite a simple question which I understand is more on myself to work out but I was wondering what your thoughts would be due to you obviously having a more in depth knowledge of the filtering clust undertakes, as well as clust having a more advanced algorithm which may be able to tolerate problems with low variance.

Also, I apologise if this is the wrong site for this rather simple questions. I am more than happy to ask them at another source if that is what you prefer. I know github is more for the programming and bug aspect.

I look forward to your response.

Ben

BaselAbujamous commented 5 years ago

Hi Ben

Thanks for asking this question. I guess GitHub is the place because it is about use cases of Clust.

Clust filters out genes as part of its process. However, if there are genes that definately need to be filtered out, it might make sense to get rid of them before clustering. I wouldn't recommend very harsh filtering before Clust as the one you have been using with WGNCA as this might reduce the richness of the input.

However, flat genes (genes that are zero all the time), are recommended to be filtered out. This is a default behaviour by Clust unless you switch it off by the --no-fil-flat option.

Also, genes that are really low in expression in all samples should be removed as their variations might simply be noise. This is important because the z-score calculations during normalisation will stretch out such noise. However, if the gene is genuinely expressed in some samples, z-score normalisation will not amplify the noise in the other samples.

You have two options of how to apply these filtering steps:

  1. to apply filtering yourself before submitting data to clust.
  2. to submit your raw data to clust and use the filtering options of clust (see https://github.com/BaselAbujamous/clust for details.

For example, the following will filter out genes that do not pass the expression value of 1, at least in 2 samples (conditions).

clust ... -fil-v 1 -fil-c 2 -fil-d 1

Below is another way of doing it, which will filter out genes that do not pass the 25th percentile of gene expression values, at least in 3 samples (conditions).

clust ... --fil-perc -fil-v 25 -fil-c 3 -fil-d 1

Always use -fil-d 1 with these options if you are using clust to analyse a single dataset.

Please feel free to discuss back this with me here.

All the best Basel

BaselAbujamous commented 5 years ago

By the way ...

If you let clust filter out and normalise your data, the filtered and normalised dataset will be given back to you as a tab-delimited file as part of the results.

benyoung93 commented 5 years ago

Hi Basel

Yes thank you that helped a-lot.

I had been playing around with the in built clust filtering VS my own cpm filtering. I do get some interesting differences with cpm and in built clust methods but, ultimately, the GO enrichment and pathway analysis resulting from this do not differ that substantially (which surprised me a little due to massive changes with cpm and WGCNA).

As off right now, I do not know what method I am going to be using. My ultimate goal is to obtain clusters, and then perform GO analysis (using topgo if you were interested) and a pathway analysis incorporating padj and log2fold change in IPA. I know how to do all of this it is just following 'best practices' for the data.

Thank you for the response and the information on the filtering, it backed up what I was feeling with this anyway and what the paper and the github mentioned as well. I only asked as I valued your opinion and expertise on this as well.

If you have any queries for me please let me know.

Ben