eead-csic-compbio / get_homologues

GET_HOMOLOGUES: a versatile software package for pan-genome analysis
Other
110 stars 26 forks source link

Core genome size abruptly dropping #81

Closed fermza closed 2 years ago

fermza commented 3 years ago

We are having an issue in which the core genome analysis results in an abrupt decrease of the core genome size as we increase the number of analyzed genomes. I understand this has a logic, but my results are confusing. So let me describe it in more detail what I did:

Commands:

get_homologues.pl -d IMG_genomes_38G-set1 -n 6 -D get_homologues.pl -d IMG_genomes_38G-set1 -n 6 -G -D -t 0 -c; get_homologues.pl -d IMG_genomes_38G-set1 -n 6 -M -D -t 0 -c

image

image

image

image

image

I did this with some different datasets, and the results are similar, with the core genome size dropping abruptly as I add more genomes.

So the question is, how is this possible? I understand the core genome will be smaller as we add more genomes to the analysis, but I cannot accept that the core genome size from same genus genomes will be so small! So, what I think is: 1) I'm doing something wrong (for example, is adding new genomes a good idea?) Since I need to reach a total of ~150 genomes, I am worried I won't be able to do it computationally at once, hence the step-wise approach. 2) I'm interpreting something incorrectly 3) The genomes I'm using are not good 4) There's a bug in the software

So, I'm asking for your input to see if you have any insights on how should I approach this issue to find possible solutions. Any guidance and all help you can provide will be greatly appreciated! Best, Fernando

brunocontrerasmoreira commented 3 years ago

Hi @fermza , in the first analysis with n=38 genomes, is the core you obtain a reasonable size? How many genes are annotated per genome?

In your second analysis, after adding Halomonas genomes, did you ran the new genomes with -D as well? It is perfectly fine to add genomes to a previously computed set (unless there's a new bug :-) Anyway, in order to rule out problems in the second analysis, you could:

Anymore ideas @vinuesa ?

Bruno

fermza commented 3 years ago

Hi Bruno, thanks a bunch for your quick reply. I did run the second analysis with -D as well, as I also did with other similar tests I ran. It is difficult to assess if the amount of clusters in the core genome for the 38 genomes analysis is reasonable. I originally thought it was too small, considering a publication we saw for 34 Comamonas genomes, with a core genome size of ~1100 clusters (doi: 10.3389/fmicb.2018.03096). Of course a direct comparison may not be that easy, or maybe even correct (that's why I didn't mention it), but I certainly would've expected more clusters. Anyways, you've given me some quite interesting insights. I will take a look into that, and eventually get back for an update. Thanks again!

Best, Fernando

On Mon, Sep 6, 2021 at 1:02 PM brunocontrerasmoreira < @.***> wrote:

Hi @fermza https://github.com/fermza , in the first analysis with n=38 genomes, is the core you obtain a reasonable size? How many genes are annotated per genome?

In your second analysis, after adding Halomonas genomes, did you ran the new genomes with -D as well? It is perfectly fine to add genomes to a previously computed set (unless there's a new bug :-) Anyway, in order to rule out problems in the second analysis, you could:

  • check how many orthologues are printed to the log/stdout when pairs of genomes are compared y makeOrthologues
  • check the BLASTP outfiles in the _homologues/ folder
  • use the script checkBDBHs.pl to see the % identity among sequences of both species, not all Genus are equally similar

Anymore ideas @vinuesa https://github.com/vinuesa ?

Bruno

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/eead-csic-compbio/get_homologues/issues/81#issuecomment-913745823, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK537YTJOGLRLUORZGLMJ4LUATQ3FANCNFSM5DQWWPQQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

brunocontrerasmoreira commented 3 years ago

I agree it's difficult to assess, but one thing you can do is compute different cores from different genome subsets to measure to what extent the core size is affected by uncomplete/poorly annotated strains. It's possible that the large soft-core you get is indicating that, Bruno

vinuesa commented 3 years ago

Hi Fernando, have you checked the sizes of the input genomes? If you have for example plasmids as single or separate GenBank files, they would cause the problem you report. Make sure that all replicons of a genome are contained in a single GenBank file.