eead-csic-compbio / get_homologues

GET_HOMOLOGUES: a versatile software package for pan-genome analysis
Other
104 stars 25 forks source link

Unable to create venn intersection BDBH, COGS, OMCL #108

Closed TommyH-Tran closed 1 year ago

TommyH-Tran commented 1 year ago

I can not create a intersection using ./compare_clusters.pl I have checked inside the folders and the .faa and .fna gene clusters are present. Then the error said to review the duplicated.cluster_list file and it is completely blank.

Here is the output:


# ./compare_clusters.pl -d /Users/klemonlab/Downloads/LG180_f0_19taxa_algBDBH_e0_C90_,/Users/klemonlab/Downloads/LG180_f0_0taxa_algCOG_e0_C90_,/Users/klemonlab/Downloads/LG180_f0_0taxa_algOMCL_e0_C90_ -o /Users/klemonlab/Desktop/THT/THT_NACH/intersection_19 -n 0 -m 0 -t 19 -I  -r 0 -s 0 -x 0 -T 0

# output directory: /Users/klemonlab/Desktop/THT/THT_NACH/intersection_19
# WARNING: output directory /Users/klemonlab/Desktop/THT/THT_NACH/intersection_19 already exists, note that you might be mixing clusters from previous runs

# number of input cluster directories = 3

# parsing clusters in /Users/klemonlab/Downloads/LG180_f0_19taxa_algBDBH_e0_C90_ ...
# cluster_list in place, will parse it (/Users/klemonlab/Downloads/LG180_f0_19taxa_algBDBH_e0_C90_.cluster_list)
# number of clusters = 270 duplicated = 0
# parsing clusters in /Users/klemonlab/Downloads/LG180_f0_0taxa_algCOG_e0_C90_ ...
# cluster_list in place, will parse it (/Users/klemonlab/Downloads/LG180_f0_0taxa_algCOG_e0_C90_.cluster_list)
# number of clusters = 0 duplicated = 0
# parsing clusters in /Users/klemonlab/Downloads/LG180_f0_0taxa_algOMCL_e0_C90_ ...
# cluster_list in place, will parse it (/Users/klemonlab/Downloads/LG180_f0_0taxa_algOMCL_e0_C90_.cluster_list)
# number of clusters = 0 duplicated = 0

# duplicated list: /Users/klemonlab/Desktop/THT/THT_NACH/intersection_19/duplicated.cluster_list (please review)

# intersection size = 0 clusters

# ERROR: cannot proceed with null intersection

eead-csic-compbio commented 1 year ago

Hi @TommyH-Tran , can you please share the LG180_f0_0taxa_algCOG_e0C90.cluster_list and LG180_f0_0taxa_algOMCL_e0C90.cluster_list files? It would also help to see the output of the ls command at LG180_f0_0taxa_algCOG_e0C90 and LG180_f0_0taxa_algOMCL_e0C90 respectively, thanks, Bruno

TommyH-Tran commented 1 year ago

I cd into both of those directories and then did ls and the gene cluster files show up in the terminal

ScreenShot 2023-06-12 at 03 33 21 ScreenShot 2023-06-12 at 03 33 49
TommyH-Tran commented 1 year ago

Here are the cluster_list files: LG180_f0_0taxa_algCOG_e0C90.cluster_list.zip LG180_f0_0taxa_algOMCL_e0C90.cluster_list.zip LG180_f0_19taxa_algBDBH_e0C90.cluster_list.zip

eead-csic-compbio commented 1 year ago

Sorry @TommyH-Tran , cannot see what's wrong so far. Would it be possible for you to send me the folders LG180_f0_0taxa_algCOG_e0C90 , LG180_f0_0taxa_algOMCL_e0C90 and LG180_f0_19taxa_algBDBH_e0C90 compressed? Thanks, Bruno

TommyH-Tran commented 1 year ago

Here are the folders: LG180_f0_0taxa_algCOG_e0C90.zip LG180_f0_0taxa_algOMCL_e0C90.zip LG180_f0_19taxa_algBDBH_e0C90.zip

eead-csic-compbio commented 1 year ago

Thanks @TommyH-Tran , it seems the reason for this output is the optional argument -t 19, which requires clusters in the intersection to contain exactly 19 sequences (single-copy) from 19 taxa. As you can see below, there are none among OCML / COG clusters:

grep -c "size=19 taxa=19" LG180_f0_19taxa_algBDBH_e0_C90_.cluster_list 
270
grep -c "size=19 taxa=19" LG180_f0_0taxa_algCOG_e0_C90_.cluster_list
0
grep -c "size=19 taxa=19" LG180_f0_0taxa_algOMCL_e0_C90_.cluster_list 
0

If you remove it the resulting intersection contains 144 clusters. Hope this helps, Bruno

TommyH-Tran commented 1 year ago

Yes, I want the intersection of single copy clusters from 19 taxa from each of the three algorithms. How is it possible in the pangenome -t 0, there are no single copy clusters identified within the COG and OMCL clusters? This is how I have usually done it and it has worked.

Is 144 the single copy clusters among all three? Or will it be fixed if i run the COG and OMCL with -t 19 option and then try to create the intersection?

eead-csic-compbio commented 1 year ago

Hi @TommyH-Tran , 144 are shared clusters, not single-copy. Options I can see:

TommyH-Tran commented 1 year ago

Could I not run COG and OMCL like this to get single copy and then do the ./compare_clusters.pl ? Or are you suggesting instead I add the -S and -e ontop of those two runs?

./get_homologues.pl -d "/Users/klemonlab/Desktop/THT/THT_NACH/NACH_gbk_19" -n 8 -t 19 -C 90 -G

./get_homologues.pl -d "/Users/klemonlab/Desktop/THT/THT_NACH/NACH_gbk_19" -n 8 -t 19 -C 90 -M

eead-csic-compbio commented 1 year ago

This is what I suggest, if 90% identity is reasonable for your analysis:

./get_homologues.pl -d "/Users/klemonlab/Desktop/THT/THT_NACH/NACH_gbk_19" -n 8 -t 19 -C 90 -G -S 90 -e

./get_homologues.pl -d "/Users/klemonlab/Desktop/THT/THT_NACH/NACH_gbk_19" -n 8 -t 19 -C 90 -M -S 90 -e
TommyH-Tran commented 1 year ago

I tried using what you suggested and still get 0 clusters. Is 90% identity too rigid?

eead-csic-compbio commented 1 year ago

It's probably too lenient. If you want single copy-clusters you need to separate the divergent copies, so you should increase it even more to see if that works, Bruno

TommyH-Tran commented 1 year ago

Sorry for the delay, I set it at 99 limit and it still gave me back 0 clusters using the flags you suggested...

./get_homologues.pl -d "/Users/klemonlab/Desktop/THT/THT_NACH/NACH_gbk_20" -n 8 -t 20 -C 99 -G -S 99 -e

brunocontrerasmoreira commented 1 year ago

At this point I guess you should inspect the pangenome matrix to see whether it is always the same genome that has values > 1 in all clusters, or whether all genomes behave like that