eead-csic-compbio / get_homologues

GET_HOMOLOGUES: a versatile software package for pan-genome analysis
Other
110 stars 26 forks source link

Question: Is there a way to identify which genomes correspond to core-genome sample numbers? #107

Closed TommyH-Tran closed 1 year ago

TommyH-Tran commented 1 year ago

Hello,

I am referring to section 4.9.4 estimation of the core/pan-genome size by sampling genomes. I created the estimated core genome size graph for 4 species. However, there appears to be core genome size splitting and I wanted to know if we are able to identify the genome(s) and the size prediction they are getting when being sampled. core_split

Tom

eead-csic-compbio commented 1 year ago

You are right, it seems that there genomes than cause a drop on the core size. You can investigate that by using get_homologues.pl -I, as explainedin the manual:

The genome composition analyses presented so far are actually random sampling experiments. It is thus worth mentioning that the user can control the order in which genomes are sampled during these simulations, by enforcing a list of genomes with option -I, already introduced in section 4.8, or by ... In the first case only one sampling is performed and therefore the standard deviation of the core and pan values is zero.

In other words, with -I the genomes are added in the order of the corresponding text file and you can easily see the genomes you are after. See here for an example include file.

Hope this helps, Bruno

TommyH-Tran commented 1 year ago

Hmmm, okay I will try this out and let you know my results soon. Thanks for getting back to me.

Tom

TommyH-Tran commented 1 year ago

Hi Bruno,

I re-ran it like this: ./get_homologues.pl -d "/Users/klemonlab/Desktop/THT_CorPGA/CorPGA_Cps/Cps_gbk_43" -c -n 8 -M -C 90 -I /Users/klemonlab/Downloads/cps_rarefaction_list.txt

Here were the results:

# file=Cps_gbk_43_homologues/core_genome_cps_rarefaction_list.txt_algOMCL_C90.tab
genomes mean    stddev  |   samples
0   2142    0   |   2142    
1   1640    0   |   1640    
2   1584    0   |   1584    
3   1559    0   |   1559    
4   1549    0   |   1549    
5   1540    0   |   1540    
6   1534    0   |   1534    
7   1524    0   |   1524    
8   1520    0   |   1520    
9   1518    0   |   1518    
10  1514    0   |   1514    
11  1505    0   |   1505    
12  1504    0   |   1504    
13  1499    0   |   1499    
14  1496    0   |   1496    
15  1493    0   |   1493    
16  1484    0   |   1484    
17  1482    0   |   1482    
18  1474    0   |   1474    
19  1467    0   |   1467    
20  1459    0   |   1459    
21  1451    0   |   1451    
22  1450    0   |   1450    
23  1438    0   |   1438    
24  1434    0   |   1434    
25  1424    0   |   1424    
26  1421    0   |   1421    
27  1415    0   |   1415    
28  1414    0   |   1414    
29  1408    0   |   1408    
30  1406    0   |   1406    
31  1403    0   |   1403    
32  1401    0   |   1401    
33  1399    0   |   1399    
34  1398    0   |   1398    
35  1397    0   |   1397    
36  1396    0   |   1396    
37  1393    0   |   1393    
38  1393    0   |   1393    
39  1392    0   |   1392    
40  1387    0   |   1387    
41  1385    0   |   1385    
42  1376    0   |   1376    

So, i know the order and the stddev became 0 as expected. However, I do not understand how I can identify which genome(s) are causing the splitting? Is it the second one cause it has the most drastic drop from 2142 to 1640? Or do I need to plot? Thanks

Tom

eead-csic-compbio commented 1 year ago

The numbers 0 to 42 correspond to the genomes in the file /Users/klemonlab/Downloads/cps_rarefaction_list.txt

The second genome in that file is causing the drop in core genes apparently

TommyH-Tran commented 1 year ago

I will take that genome out and re-run it and see if the plot still splits or not, will keep you up updated..

TommyH-Tran commented 1 year ago

Looks like it definitely helped, as you can see i took it out and now it's 42 genomes total instead of 43. There still seems to be another one with a slightly higher estimated core genome size. I tried the same thing a second time but couldn't identify the other one. But for now that seems good enough.

Taking the second genome out on the txt file, total genomes (42): core_genome_algOMCL_C90 tab_core_both

Taking another genome out thinking it is the outlier, total genomes (41): core_genome_algOMCL_C90 tab_core_both

Tom

TommyH-Tran commented 1 year ago

Do you think there is another way to identify the other outlier? Here was the results when I was trying to identify "that" second outlier. I removed the second one on the list cause it seemed to have the biggest drop. But seems like it wasn't that genome from the bottom image.

# core-genome (number of genes, can be plotted with plot_pancore_matrix.pl)
# file=Cps_gbk_42_rarefaction_homologues/core_genome_cps_42_rarefaction.txt_algOMCL_C90.tab
genomes mean    stddev  |   samples
0   2088    0   |   2088    
1   1801    0   |   1801    
2   1767    0   |   1767    
3   1753    0   |   1753    
4   1742    0   |   1742    
5   1727    0   |   1727    
6   1723    0   |   1723    
7   1721    0   |   1721    
8   1712    0   |   1712    
9   1707    0   |   1707    
10  1699    0   |   1699    
11  1698    0   |   1698    
12  1693    0   |   1693    
13  1689    0   |   1689    
14  1683    0   |   1683    
15  1672    0   |   1672    
16  1670    0   |   1670    
17  1658    0   |   1658    
18  1649    0   |   1649    
19  1640    0   |   1640    
20  1632    0   |   1632    
21  1630    0   |   1630    
22  1617    0   |   1617    
23  1612    0   |   1612    
24  1601    0   |   1601    
25  1598    0   |   1598    
26  1592    0   |   1592    
27  1591    0   |   1591    
28  1585    0   |   1585    
29  1583    0   |   1583    
30  1579    0   |   1579    
31  1575    0   |   1575    
32  1573    0   |   1573    
33  1571    0   |   1571    
34  1570    0   |   1570    
35  1569    0   |   1569    
36  1565    0   |   1565    
37  1565    0   |   1565    
38  1564    0   |   1564    
39  1558    0   |   1558    
40  1556    0   |   1556    
41  1546    0   |   1546    

Tom

eead-csic-compbio commented 1 year ago

Hi @TommyH-Tran , it's not obvious to me whether there's a remaining outlier. With something like this you can quickly check how much genes are lost after adding a new genomes:

perl -lane 'print $F[1]-$prev; $prev=$F[1]' Cps_gbk_42_rarefaction_homologues/core_genome_cps_42_rarefaction.txt_algOMCL_C90.tab

It seems the largest drop is after adding the second genome, but that's to be expected I guess.
Please also note that outliers can actually be relevant and correctly annotated genomes in some cases, I guess the only reason you could leave them out would be assemblies and/or annotation of low quality, Bruno

TommyH-Tran commented 1 year ago

Is there any type of table file get_homologues outputs of presence and absence of core and accessory genes across the genomes?

Tom

eead-csic-compbio commented 1 year ago

Not sure I understand your question, but you might want to check file pangenome_matrix_genes_t0.tr.tab , produced by compare_clusters.pl. It looks like this:

cut -f 1-5 pangenome_matrix_genes_t0.tr.tab | head
source:sample_intersection  E_coli_ST131_plasmid_pKC394.gb  E_coli_plasmid_pMUR050.gb   IncN_plasmid_R46.gb K_oxytoca_plasmid_pKOX105.gb
7_transposase_A.faa ID:ADH29967.1,  -   -   -
8_tnpA.faa  ID:ADH30013.1,ID:ADH30014.1,    -   -   -
9_mphA.faa  ID:ADH29968.1,  -   -   -
10_mrx.faa  ID:ADH29969.1,  -   -   -
11_blaCTX-M-1.faa   ID:ADH29970.1,  -   -   -
13_hypothetical_protein.faa ID:ADH29971.1,  -   -   -
34_methyl-accepting_che...faa   ID:ADH30015.1,  -   -   -
35_tnpA.faa ID:ADH29992.1,  ID:AAT01860.1,  -   -
36_beta-lactamase_CTX-M...faa   ID:ADH29993.1,  -   -   ID:ADH29557.1,

There you can see exactly which genes are included in each cluster/row. With one liners it is possible to select core or soft-core rows for instance.