Core genome size abruptly dropping

fermza commented 3 years ago

We are having an issue in which the core genome analysis results in an abrupt decrease of the core genome size as we increase the number of analyzed genomes. I understand this has a logic, but my results are confusing. So let me describe it in more detail what I did:

Retrieved multiple (> 150) genomes from Halomonas genus (bacteria). We selected genomes within a range of sizes and number of annotated genes, as a measure of correctly annotated genomes (always comparing intra Halomonas genus).
Used a number of genomes (38) with "good" quality according to annotation @ IMG database as "Finished".
Ran GET_HOMOLOGUES with BDBH, Pfam, COG and OMCL, and run the core and pangenome analysis. Very quickly, this yielded:

Commands:

get_homologues.pl -d IMG_genomes_38G-set1 -n 6 -D get_homologues.pl -d IMG_genomes_38G-set1 -n 6 -G -D -t 0 -c; get_homologues.pl -d IMG_genomes_38G-set1 -n 6 -M -D -t 0 -c

Note, therefore, there are ~250 clusters corresponding to core genome, with ~1000 clusters in soft core genome.
Next, the idea is to start to add genomes in the analysis. So, with this in mind, we added 14 Halomonas genomes. As we understand, it is possible to add new genomes into the analysis without the need of re-running everything for the original 38 genomes set again (in a way, GET_HOMOLOGUES will interpret what has been already executed and only execute "new" analysis). So we added these 14 genomes into the folder, and repeated all the analysis. Keep in mind also, we have used the same reference genome, which is in turn the smallest in the dataset. Once it is done, we repeated the core and pangenome analysis, and here it is the issue:

Statistical analysis correspond with this observation (it seems to very quickly tend to 0):

I did this with some different datasets, and the results are similar, with the core genome size dropping abruptly as I add more genomes.

So the question is, how is this possible? I understand the core genome will be smaller as we add more genomes to the analysis, but I cannot accept that the core genome size from same genus genomes will be so small! So, what I think is: 1) I'm doing something wrong (for example, is adding new genomes a good idea?) Since I need to reach a total of ~150 genomes, I am worried I won't be able to do it computationally at once, hence the step-wise approach. 2) I'm interpreting something incorrectly 3) The genomes I'm using are not good 4) There's a bug in the software

So, I'm asking for your input to see if you have any insights on how should I approach this issue to find possible solutions. Any guidance and all help you can provide will be greatly appreciated! Best, Fernando

brunocontrerasmoreira commented 3 years ago

Hi @fermza , in the first analysis with n=38 genomes, is the core you obtain a reasonable size? How many genes are annotated per genome?

In your second analysis, after adding Halomonas genomes, did you ran the new genomes with -D as well? It is perfectly fine to add genomes to a previously computed set (unless there's a new bug :-) Anyway, in order to rule out problems in the second analysis, you could:

check how many orthologues are printed to the log/stdout when pairs of genomes are compared y makeOrthologues
check the BLASTP outfiles in the _homologues/ folder
use the script checkBDBHs.pl to see the % identity among sequences of both species, not all Genus are equally similar

Anymore ideas @vinuesa ?

Bruno

fermza commented 3 years ago

Hi Bruno, thanks a bunch for your quick reply. I did run the second analysis with -D as well, as I also did with other similar tests I ran. It is difficult to assess if the amount of clusters in the core genome for the 38 genomes analysis is reasonable. I originally thought it was too small, considering a publication we saw for 34 Comamonas genomes, with a core genome size of ~1100 clusters (doi: 10.3389/fmicb.2018.03096). Of course a direct comparison may not be that easy, or maybe even correct (that's why I didn't mention it), but I certainly would've expected more clusters. Anyways, you've given me some quite interesting insights. I will take a look into that, and eventually get back for an update. Thanks again!

Best, Fernando

On Mon, Sep 6, 2021 at 1:02 PM brunocontrerasmoreira < @.***> wrote:

Hi @fermza https://github.com/fermza , in the first analysis with n=38 genomes, is the core you obtain a reasonable size? How many genes are annotated per genome?

In your second analysis, after adding Halomonas genomes, did you ran the new genomes with -D as well? It is perfectly fine to add genomes to a previously computed set (unless there's a new bug :-) Anyway, in order to rule out problems in the second analysis, you could:

check how many orthologues are printed to the log/stdout when pairs of genomes are compared y makeOrthologues

check the BLASTP outfiles in the _homologues/ folder

use the script checkBDBHs.pl to see the % identity among sequences of both species, not all Genus are equally similar

Anymore ideas @vinuesa https://github.com/vinuesa ?

Bruno

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/eead-csic-compbio/get_homologues/issues/81#issuecomment-913745823, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK537YTJOGLRLUORZGLMJ4LUATQ3FANCNFSM5DQWWPQQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

brunocontrerasmoreira commented 3 years ago

I agree it's difficult to assess, but one thing you can do is compute different cores from different genome subsets to measure to what extent the core size is affected by uncomplete/poorly annotated strains. It's possible that the large soft-core you get is indicating that, Bruno

vinuesa commented 3 years ago

Hi Fernando, have you checked the sizes of the input genomes? If you have for example plasmids as single or separate GenBank files, they would cause the problem you report. Make sure that all replicons of a genome are contained in a single GenBank file.

eead-csic-compbio / get_homologues

Core genome size abruptly dropping #81