Different results when annotating a large protein set at once vs splitting the input into multiple files

cschu commented 1 year ago

Hi,

I use eggnog-mapper to annotate large sets of metagenomic datasets, e.g. the GMGC (GMGC10.95nr.faa.gz >> 300 million sequences).

In a recent experiment, I found that the the annotation results differ when running emapper against the whole set (which has an 18 day runtime and has incurred the wrath of many a cluster admin, but I digress...) versus when I do it in about 15,000 batches of about 20,000 sequences each.

In the following, I'll refer to the full-set-at-once annotation as "full" and the batch annotation as "batch":

emapper-version is 2.1.7, database, I am not sure (is there a way to find that out? I remember earlier versions were writing the (wrong) database version into the output.)

The command I run is

emapper.py -i <protein.faa> --data_dir </path/to/eggnog_db> --output <prefix> -m diamond --cpu {64: full, 8: batch} --dbmem --temp_dir </path/to/tmp-dir>

Results:

Number of genes with annotations in ".annotations" file: full = 231578539, batch = 231100472
Number of genes with annotations present in both outputs: 231668483
- of those, non-identical annotations*: 1035009
Number of genes with annotations only present in either set: full = 568011, batch = 89944

*) when comparing the full lines in the .annotations file, these will differ in various columns, e.g.

GMGC10.033_471_658.UNKNOWN      1120971.AUCA01000049_gene802    1.09e-07        49.7    COG2002@1|root,COG2002@2|Bacteria       2|Bacteria      K       toxin-antitoxin pair type II binding    -       -       -       ko:K06284       -       -       -       -       ko00000>
GMGC10.033_471_658.UNKNOWN      1163671.JAGI01000002_gene3538   2.18e-07        50.8    COG2002@1|root,COG2002@2|Bacteria,1V02A@1239|Firmicutes,248XR@186801|Clostridia,36E8G@31979|Clostridiaceae      186801|Clostridia       K       stage V sporulation protein T   spoVT  >

It looks to me as if about 54% of those differently annotated genes have an identical description (column 8) -- just a quick naive string comparison on column 8.

I am a bit surprised by these results, but maybe this is expected? If input size makes a difference in terms of some kind of e-value cutoff (as you would see in e.g. BLAST, when using databases of different sizes), then I'd expect to see either more or less hits full vs. batch, but not in both directions. But since -- I assume -- emapper checks the input sequences against the eggnog database, I would not expect the input size to make any difference when the database stays the same. Is there something I haven't thought of?

If this behaviour is expected and splitting the input caused different annotation results, what would be the recommendation to annotate large input sets?

Thanks, Christian

Cantalapiedra commented 1 year ago

Hi @cschu ,

Do you find these different results also among the ".seed_orthologs" file?

Best, Carlos

cschu commented 1 year ago

Hi @Cantalapiedra,

Yes, e.g.

GMGC10.033_471_658.UNKNOWN
full: 1120971.AUCA01000049_gene802  1.09e-07    49.7    1   33  3   35  63.6    100.0   40.2
batch: 1163671.JAGI01000002_gene3538    2.18e-07    50.8    1   33  1   33  69.7    100.0   18.1

In the case of the annotations that are present in both outputs but differ, they also differ in the seed orthologs. There is a number of cases (10,726) with the same seed ortholog, but different e-values and other columns, e.g.

GMGC10.000_033_222.UVRD
full: 1120988.AXWV01000032_gene1875 1.6e-195    589.0   1   794 1   932 39.7    97.1    98.1
batch: 1120988.AXWV01000032_gene1875    1.13e-175   538.0   1   632 1   677 43.9    77.3    71.3

Concerning the annotations that are only present in one of the two sets, I do not generally find those in the seed orthologs of the other set, only 56 of the batch set are found in the full set seed orthologs and only 19 of the full set are found in the batch set seed orthologs, e.g.

GMGC10.066_171_355.UNKNOWN (no final annotation in the batch result)
in full: 1283076.M1ICX5_9CAUD   3.69e-27    103.0   2   119 4   121 42.4    99.2    97.5
in batch: 382359.Q0QZ36_BPSYS   1.26e-27    105.0   2   119 4   120 45.8    99.2    97.5

Best, Christian

Cantalapiedra commented 1 year ago

Hi @cschu ,

Thank you for your answer. Then, if I understand correctly, this is an issue of diamond not guaranteeing finding the best alignment and best hit? Also it seems to not always extend the alignment properly.

cschu commented 1 year ago

Hi @Cantalapiedra,

Thanks for checking this. Is there any plan to address this issue?

If the diamond results are not reliable in terms of reproducibility this directly translates to the results of eggnog mapper -- at least for large datasets. Of course not everyone will annotate such large datasets as the GMGC, so this may be a limited use case. Nevertheless, I also see a problem in the set differences in both input sets:

If one of the input sets were to display the current behaviour (i.e. having gene annotations that are not present in the other set), then at least a recommendation could be made only to use full or batch sets. In the current state, it appears using the diamond option is too risky in terms of completeness. Maybe a warning in the docs would be appropriate?

I guess I will check how emapper using HMMer behaves.

Best, Christian

Cantalapiedra commented 1 year ago

Hi @cschu ,

I guess we maybe should ask Diamond authors, to know their opinion, and to confirm whether it is actually an issue with Diamond.

Besides Diamond and HMMER, you have also the option to use MMseqs2 if you wish. I ignore if it has the same or a similar issue as Diamond, though.

Best, Carlos

cschu commented 1 year ago

Hi @Cantalapiedra ,

@fullama has pointed out that they had a similar issue when developing gunc (https://github.com/grp-bork/gunc) and it is likely to do with the --algo option, which is by default set to auto by emapper (which then leads to not setting this option at all in the diamond subprocess). Without specifying --algo, diamond will auto-detect what it thinks is the optimal value and this depends on the input set size The diamond docs also state:

Note that while the two algorithms are configured to provide roughly the same sensitivity for the respective modes, results will not be exactly identical to each other.

and

The algorithm used will be displayed at program startup. (this information seems to get lost by emapper catching the subprocess stdout/stderr, unfortunately, so I cannot verify if the two (or rather 15k + 1) runs indeed use a different setting, but it seems to be a reasonable assumption.)

(s. https://github.com/bbuchfink/diamond_docs/blob/master/3%20Command%20line%20options.MD).

I am now re-running my tests with pinning --algo to 0 via the emapper option --dmnd_algo 0 (don't have the time to build a smaller example) and will let you know (in about a month...).

Best, Christian

Cantalapiedra commented 1 year ago

Hi @cschu ,

If I understand correctly, the difference in results between your runs is due to the size of the data set making diamond choose a different --algo parameter?

Note that, if I am not mistaken, emapper is not setting --algo by default, so it should be running just with the default diamond --algo option (which is actually the auto option).

I think is good idea trying with --algo 0. Did you manage to check the results?

Thank you very much for these insights!

Best, Carlos

cschu commented 1 year ago

Hi @Cantalapiedra ,

If I understand correctly, the difference in results between your runs is due to the size of the data set making diamond choose a different --algo parameter?

Yes, that is correct.

Note that, if I am not mistaken, emapper is not setting --algo by default, so it should be running just with the default diamond --algo option (which is actually the auto option).

Also correct (at least that's how it looks like in the code).

Did you manage to check the results?

Due to limited cluster capacity, the computations are still running (cannot believe another month has passed already...). See you next month :D

Best, Christian

Cantalapiedra commented 1 year ago

Hi @cschu ,

Thank you for your answers. Yes, days, months, years, pass very fast ;)

Are you running these analyses in a cluster? If that is the case, it may be worth to separate the search (--no_annot) and annotation (-m no_search --annotate_hits_table emapper.seed_orthologs --dbmem) stages, depending on your infrastructure, because the search stage benefits from using more CPUs, instead of parallelizing jobs, whereas the annotation stage using --dbmem may benefit more by parallelizing more jobs (edit: although the annotation stage parallelization may be limited by the memory required to load the DB, of course, depending on how much available you have).

Best, Carlos

eggnogdb / eggnog-mapper

Different results when annotating a large protein set at once vs splitting the input into multiple files #444