Taxonomy changed when input contigs are different

xwu35 commented 1 year ago

Hi,

I used geNomad to identify viruses in addition to VirSorter2 and Cenote-taker2. Subsequently, I used geNomad's annotate module to assign taxonomy to all viral contigs, including those identified by other tools. I noticed that three of them received completely different taxonomic assignment when the input contigs were altered.

For instance, one of them was initially categprozed as Algavirales (Varidnaviria › Bamfordvirae › Nucleocytoviricota › Megaviricetes ) when the input contigs consisted of all assembled contigs (approximately 70,000), but it was then classified as Caudoviricetes (Duplodnaviria › Heunggongvirae › Uroviricota) when only viral contigs were used as the input (around 5000). Is this expected? I would think that taxonomy annotation should be more stable across different input contigs.

Many thanks.

apcamargo commented 1 year ago

This shouldn't happen. The only thing that would lead to a difference in the taxonomy is a variance in the annotation process, but changing the input should not alter the annotation of a given sequence.

Can you provide the rows of both _genes.tsv files that contain the genes of this example you mentioned?

xwu35 commented 1 year ago

Thanks for the quick response. Please see below:

all assembled contigs as input: contig_11421_1 2 1918 1917 1 0.529 11 None GENOMAD.002410.VV 8.55e-08 58 0 0 1 11844 Phycodnaviridae NA NA PF16903 Major capsid protein N-terminus contig_38880_1 3 242 240 1 0.483 11 None NA NA NA 0 0 0 1 NA NA NA NA NA contig_38880_2 302 967 666 1 0.434 11 None GENOMAD.171343.VP 3.504e-16 80 0 0 0 12031 Mimiviridae NA NA PF18406 Ferredoxin-like domain in Api92-like protein contig_38880_3 1043 1222 180 1 0.383 11 AGGA NA NA NA 0 0 0 1 NA NA NA NA NA contig_802_1 1 126 126 -1 0.548 11 GGAG/GAGG NA NA NA 0 0 0 1 NA NA NA NA NA contig_802_2 153 3035 2883 -1 0.537 11 None GENOMAD.068590.VV 0.0001592 48 0 0 1 2561 Caudoviricetes NA NA PF13871;PF11242 C-terminal domain on Strawberry notch homologue

only viral contigs as input: contig_11421_1 2 1918 1917 1 0.529 11 None GENOMAD.088220.VV 7.588e-11 68 0 0 0 2561 Caudoviricetes NA NA NA NA contig_38880_1 3 242 240 1 0.483 11 None NA NA NA 0 0 0 1 NA NA NA NA NA contig_38880_2 302 967 666 1 0.434 11 None GENOMAD.055798.VV 8.984e-50 178 0 0 0 2561 Caudoviricetes NA NA PF19174 NA contig_38880_3 1043 1222 180 1 0.383 11 AGGA NA NA NA 0 0 0 1 NA NA NA NA NA contig_802_1 1 126 126 -1 0.548 11 GGAG/GAGG NA NA NA 0 0 0 1 NA NA NA NA NA contig_802_2 153 3035 2883 -1 0.537 11 None GENOMAD.056032.VV 1.96e-05 51 0 0 0 11844 Phycodnaviridae NA NA PF00176 ERCC4-related helicase

apcamargo commented 1 year ago

Did you use the same geNomad version for both runs? Which version was it? What version of the database did you use?

Not related to this, are these all the genes in your contigs? Be extra careful when working with such short sequences.

xwu35 commented 1 year ago

Yes, I used the same geNomad version 1.5.2 and database version 1.2.

We had to use contigs >= 1kb since the majority of the assembled contigs were short than 3 kb.

apcamargo commented 1 year ago

Indeed, it seems that the size of the input does change the annotation in my testing. I didn't expect that since the E-value depends on the database size, not query size.

My guess is that this is a MMseqs2 thing. @milot-mirdita, do you know what could be causing this? geNomad basically uses the proteins encoded by the input sequences as query and searches a profile database. I then take the best hit of each query protein. You can find the commands here.

Maybe this is because I'm using --max-rejected? If the order of the alignments of a given query is different depending on the number of queries, --max-rejected would stop the alignment computation at different points.

apcamargo commented 1 year ago

@xwu35 I did some more investigation and I found the cause of this issue.

Short story: the annotation should be very slightly more reliable when your input has less sequences (in your case, the input with only viral sequences). In any case, I wouldn't really recommend using the taxonomy of very short sequences.

Long story: It seems that the order of the sequences in the input matter when using --max-rejected. I don't really understand why, because I'd assume that the counter resets for each query. The problem is that if I stop using --max-rejected the runtime will increase significantly, so I'll look into other options to solve this.

xwu35 commented 1 year ago

Thanks for looking into it.

Will this issue make the virus identification step unreliable since part of the identification method is to find aligned viral marker genes?

apcamargo commented 1 year ago

The effects will be minimal, you shouldn't worry. This issue affects very few proteins within a sample and only alignments with high E-value are susceptible to variance.

xwu35 commented 1 year ago

Thanks, it is good to know.

apcamargo commented 1 year ago

I just pushed a change to the way the MMseqs2 searches are performed that will mitigate this issue

apcamargo / genomad

Taxonomy changed when input contigs are different #33