Duplicate clusters - Githubissues

carolynzy commented 3 years ago

Hi, I have a problem when using compare_clusters.pl to produce the pangenome matrix based on the results of COG and OMCL together. To be specific, there are two clusters, 4450 and 2318, each of which only has one genome/sample in it and the protein seuqences are identical. These two clusteres exist in both COG and OMCL results folder. These two clusters were considered as duplicates and the cluster 4450 was skipped when I was using compare_clusters.pl to produce pangenome matrix. The command I used is as follows:

compare_clusters.pl -o intersection_COG_OMCL -m -d 304_f0_0taxa_algCOGe0,304_f0_0taxa_algOMCLe0

The information from the log as below:

# /home/carolyn/get_homologues-x86_64-20210828/compare_clusters.pl -d 304_f0_0taxa_algCOGe0,304_f0_0taxa_algOMCLe0 -o intersection_COG_OMCL -n 0 -m 1 -t 0 -I -r 0 -s 0 -x 0 -T 0

# number of input cluster directories = 2

# parsing clusters in 304_f0_0taxa_algCOGe0 ... # cluster_list in place, will parse it (304_f0_0taxa_algCOGe0.cluster_list) # WARNING: skipping cluster 4450_PE-PGRS_family_prote...faa , seems to duplicate 2318_PE-PGRS_family_prote...faa #WARNING: skipping cluster 5089_hypothetical_protein.faa , seems to duplicate 443_hypothetical_protein.faa

However, although the sequence are identical in these two clusters, the samples/genomes are different. So in the pangenome matrix, only one sample (from 2318) shows the presence of this protein, and the sample in 4450 shows absence of this protein, as the result of 4450 being skipped.

Maybe I'm confused but I think both samples should have this protein, which disagrees with the pangenome matrix result. Would you please clarify this? Thank you very much!

eead-csic-compbio commented 3 years ago

Hi, this is a complex situation which was explained in the original manual:

This is issued by compare_clusters.pl when it finds, usually singleton, clusters with identical sequences produced by the
COG or OMCL algorithms. This can happen when such clusters contain short sequences, or perhaps with composition 
biases, that yield few or even no BLAST hits when compared to all other sequences in a given setup. As these kinds of 
clusters can confound posterior analysis they are currently ignored by compare_clusters.pl.

You are right this is confusing, so you are suggesting we put these singletons together on the same cluster as a correction? What do you t hink @pvinuesa ?

carolynzy commented 3 years ago

@eead-csic-compbio Thank you for the clarification! I agree that these clusters should be treated more carefully. My concern is that if only one cluster is kept in the pangenome matrix and the other one is excluded, I may get the false information that only one sample contains this sequence. This could be a problem when analyzing the difference of pangenome composition between different groups, or calculate the similarity between samples. (Though the impact could be very small.) To make the results more consistent, would it be better to keep or exclude both clusters together? Maybe we could exclude all these type of duplicated clusters and write them into a seperate file? I have very little experience of pangenome analysis. I am not sure whether this is practical or reasonable. Thank you for helping me to work this out.

eead-csic-compbio commented 3 years ago

I guess having a separate file with duplicated clusters is a possibility, should we leave those clusters out of the pangenome matrices by default as well? Do you agree @pvinuesa ?

vinuesa commented 3 years ago

This is an interesting and non-trivial discussion. In my experience, this situation is often encountered with singleton ISs (insertion sequences). I agree with @carolynzy's view that it is not reasonable to exclude one of the clusters. I like @eead-csic-compbio's proposal to list those cases in a separate file for the users to inspect in detail, if required. Removing such clusters from the pangenome matrix would be a very conservative approach, but is probably better than leaving out one of the duplicate clusters. We need to improve the strategy to handle these cases. Thanks @carolynzy for pointing out this issue.

carolynzy commented 3 years ago

@vinuesa You are welcome! I would be happy and eager to use the updated version.

eead-csic-compbio commented 3 years ago

Looking at the code I recalled that a unique cluster key is built by sorting and joining sequences after remiving the first 3 and last 3 residues to avoid start and stop codon issues:

@choppedseqs = map {substr($_,3,-3)} @clusterseqs;
$clusterkey = join(' ',(sort(@choppedseqs)));

For singletons, which in my experience amount to most of these cases, there will be only sequence, This means we can safely merge "duplicated" clusters and make a larger cluster, while the original ones would be removed. This could be an option in compare_clusters.pl but it worries me a little that this adds new clusters not produced directly by get_homologues. Any comments @pvinuesa?

eead-csic-compbio commented 3 years ago

Hi @carolynzy @vinuesa , I have updated compare_clusters.pl ; you can try it out with $ git pull , let me know how that goes, Bruno

vinuesa commented 3 years ago

Hi Bruno, thank you @eead-csic-compbio very much!" I'll check it out later today on the pIncC dataset and let you know. Cheers

eead-csic-compbio / get_homologues

Duplicate clusters #83