Closed TommyH-Tran closed 1 year ago
You are right, it seems that there genomes than cause a drop on the core size. You can investigate that by using get_homologues.pl -I, as explainedin the manual:
The genome composition analyses presented so far are actually random sampling experiments. It is thus worth mentioning that the user can control the order in which genomes are sampled during these simulations, by enforcing a list of genomes with option -I, already introduced in section 4.8, or by ... In the first case only one sampling is performed and therefore the standard deviation of the core and pan values is zero.
In other words, with -I the genomes are added in the order of the corresponding text file and you can easily see the genomes you are after. See here for an example include file.
Hope this helps, Bruno
Hmmm, okay I will try this out and let you know my results soon. Thanks for getting back to me.
Tom
Hi Bruno,
I re-ran it like this:
./get_homologues.pl -d "/Users/klemonlab/Desktop/THT_CorPGA/CorPGA_Cps/Cps_gbk_43" -c -n 8 -M -C 90 -I /Users/klemonlab/Downloads/cps_rarefaction_list.txt
Here were the results:
# file=Cps_gbk_43_homologues/core_genome_cps_rarefaction_list.txt_algOMCL_C90.tab
genomes mean stddev | samples
0 2142 0 | 2142
1 1640 0 | 1640
2 1584 0 | 1584
3 1559 0 | 1559
4 1549 0 | 1549
5 1540 0 | 1540
6 1534 0 | 1534
7 1524 0 | 1524
8 1520 0 | 1520
9 1518 0 | 1518
10 1514 0 | 1514
11 1505 0 | 1505
12 1504 0 | 1504
13 1499 0 | 1499
14 1496 0 | 1496
15 1493 0 | 1493
16 1484 0 | 1484
17 1482 0 | 1482
18 1474 0 | 1474
19 1467 0 | 1467
20 1459 0 | 1459
21 1451 0 | 1451
22 1450 0 | 1450
23 1438 0 | 1438
24 1434 0 | 1434
25 1424 0 | 1424
26 1421 0 | 1421
27 1415 0 | 1415
28 1414 0 | 1414
29 1408 0 | 1408
30 1406 0 | 1406
31 1403 0 | 1403
32 1401 0 | 1401
33 1399 0 | 1399
34 1398 0 | 1398
35 1397 0 | 1397
36 1396 0 | 1396
37 1393 0 | 1393
38 1393 0 | 1393
39 1392 0 | 1392
40 1387 0 | 1387
41 1385 0 | 1385
42 1376 0 | 1376
So, i know the order and the stddev became 0 as expected. However, I do not understand how I can identify which genome(s) are causing the splitting? Is it the second one cause it has the most drastic drop from 2142 to 1640? Or do I need to plot? Thanks
Tom
The numbers 0 to 42 correspond to the genomes in the file /Users/klemonlab/Downloads/cps_rarefaction_list.txt
The second genome in that file is causing the drop in core genes apparently
I will take that genome out and re-run it and see if the plot still splits or not, will keep you up updated..
Looks like it definitely helped, as you can see i took it out and now it's 42 genomes total instead of 43. There still seems to be another one with a slightly higher estimated core genome size. I tried the same thing a second time but couldn't identify the other one. But for now that seems good enough.
Taking the second genome out on the txt file, total genomes (42):
Taking another genome out thinking it is the outlier, total genomes (41):
Tom
Do you think there is another way to identify the other outlier? Here was the results when I was trying to identify "that" second outlier. I removed the second one on the list cause it seemed to have the biggest drop. But seems like it wasn't that genome from the bottom image.
# core-genome (number of genes, can be plotted with plot_pancore_matrix.pl)
# file=Cps_gbk_42_rarefaction_homologues/core_genome_cps_42_rarefaction.txt_algOMCL_C90.tab
genomes mean stddev | samples
0 2088 0 | 2088
1 1801 0 | 1801
2 1767 0 | 1767
3 1753 0 | 1753
4 1742 0 | 1742
5 1727 0 | 1727
6 1723 0 | 1723
7 1721 0 | 1721
8 1712 0 | 1712
9 1707 0 | 1707
10 1699 0 | 1699
11 1698 0 | 1698
12 1693 0 | 1693
13 1689 0 | 1689
14 1683 0 | 1683
15 1672 0 | 1672
16 1670 0 | 1670
17 1658 0 | 1658
18 1649 0 | 1649
19 1640 0 | 1640
20 1632 0 | 1632
21 1630 0 | 1630
22 1617 0 | 1617
23 1612 0 | 1612
24 1601 0 | 1601
25 1598 0 | 1598
26 1592 0 | 1592
27 1591 0 | 1591
28 1585 0 | 1585
29 1583 0 | 1583
30 1579 0 | 1579
31 1575 0 | 1575
32 1573 0 | 1573
33 1571 0 | 1571
34 1570 0 | 1570
35 1569 0 | 1569
36 1565 0 | 1565
37 1565 0 | 1565
38 1564 0 | 1564
39 1558 0 | 1558
40 1556 0 | 1556
41 1546 0 | 1546
Tom
Hi @TommyH-Tran , it's not obvious to me whether there's a remaining outlier. With something like this you can quickly check how much genes are lost after adding a new genomes:
perl -lane 'print $F[1]-$prev; $prev=$F[1]' Cps_gbk_42_rarefaction_homologues/core_genome_cps_42_rarefaction.txt_algOMCL_C90.tab
It seems the largest drop is after adding the second genome, but that's to be expected I guess.
Please also note that outliers can actually be relevant and correctly annotated genomes in some cases, I guess the only reason you could leave them out would be assemblies and/or annotation of low quality,
Bruno
Is there any type of table file get_homologues outputs of presence and absence of core and accessory genes across the genomes?
Tom
Not sure I understand your question, but you might want to check file pangenome_matrix_genes_t0.tr.tab , produced by compare_clusters.pl. It looks like this:
cut -f 1-5 pangenome_matrix_genes_t0.tr.tab | head
source:sample_intersection E_coli_ST131_plasmid_pKC394.gb E_coli_plasmid_pMUR050.gb IncN_plasmid_R46.gb K_oxytoca_plasmid_pKOX105.gb
7_transposase_A.faa ID:ADH29967.1, - - -
8_tnpA.faa ID:ADH30013.1,ID:ADH30014.1, - - -
9_mphA.faa ID:ADH29968.1, - - -
10_mrx.faa ID:ADH29969.1, - - -
11_blaCTX-M-1.faa ID:ADH29970.1, - - -
13_hypothetical_protein.faa ID:ADH29971.1, - - -
34_methyl-accepting_che...faa ID:ADH30015.1, - - -
35_tnpA.faa ID:ADH29992.1, ID:AAT01860.1, - -
36_beta-lactamase_CTX-M...faa ID:ADH29993.1, - - ID:ADH29557.1,
There you can see exactly which genes are included in each cluster/row. With one liners it is possible to select core or soft-core rows for instance.
Hello,
I am referring to section 4.9.4 estimation of the core/pan-genome size by sampling genomes. I created the estimated core genome size graph for 4 species. However, there appears to be core genome size splitting and I wanted to know if we are able to identify the genome(s) and the size prediction they are getting when being sampled.
Tom