Non-detailed taxonomy classification. Possibility of IMG/VR database integration

poursalavati commented 1 year ago

First of all, thank you very much for developing this tool. Using this tool was a great experience.

I have four questions,

In 95% of cases, the taxonomy results only go as far as the "order" layer. Only in some cases "family" is also reported. Can this situation be improved? Is it possible to reach other layers including genera or even species? Or at least all cases can include family?

The second question, Most of the results (over 90%) are placed in the Unclassified group. Is there a way to classify them?

The third question, In this address https://zenodo.org/record/7084650, you have also placed three other databases. What is their use? Is it possible to use this as an example: genomad_hmm_v1.1.tar.gz

The fourth question, Is it possible to use the AMG/VR 4 database (as you mentioned in its paper) along with this pipeline (or as a database for this pipeline)? To achieve a more accurate taxonomy as well as fewer Unclassified.

Thank you very much in advance, NP

apcamargo commented 1 year ago

Hi @poursalavati

First of all, thank you very much for developing this tool. Using this tool was a great experience.

Thanks you! I really appreciate hearing this from users :)

In 95% of cases, the taxonomy results only go as far as the "order" layer. Only in some cases "family" is also reported. Can this situation be improved? Is it possible to reach other layers including genera or even species? Or at least all cases can include family?

geNomad can only classify up to the family level. If your genomes are mostly getting assigned to the order rank, it means that you probably don't have enough evidence to assign them to a specific family. Keep in mind that most viruses in environmental data don't belong to known species/genera/families.

Assuming that you suspect your genomes might belong to known genera, my recommendation is to cluster your genomes to RefSeq/GenBank genomes using the method described here. If your genomes cluster together with a reference, you can assign them to that genus.

Most of the results (over 90%) are placed in the Unclassified group. Is there a way to classify them?

If they are left unclassified, it means that no taxonomically-informed marker hit the proteins encoded in that sequence. So, there's no way you can assign taxonomy to those sequences. This number is pretty high, though. Usually most sequences do get a taxonomy. Can you share the data with me so I can double check what is going on?

In this address zenodo.org/record/7084650, you have also placed three other databases. What is their use? Is it possible to use this as an example: genomad_hmm_v1.1.tar.gz

genomad_msa_v1.1.tar.gz: Multiple sequence alignments of the geNomad markers. They are used to generate the MMseqs2 database. I make them available because people might want to look at the alignments or because they might be useful in the future for something else. They are not explicitly used by geNomad.
genomad_hmm_v1.1.tar.gz: HMMs generated from the alignments. They are not used by geNomad, but people might want to use them to annotate proteins using HMMER, which is more sensitivity (and slower) than MMseqs2.

Is it possible to use the AMG/VR 4 database (as you mentioned in its paper) along with this pipeline (or as a database for this pipeline)? To achieve a more accurate taxonomy as well as fewer Unclassified.

You can download the IMG/VR data here and cluster you genomad with IMG/VR's with the code I mentioned above.

poursalavati commented 1 year ago

Thanks a lot for the information @apcamargo,

Can you share the data with me so I can double check what is going on?

Please let me know which file you prefer to share.

my recommendation is to cluster your genomes

I was thinking to find a way to efficiently bin the contigs after assembly (before using geNomad, because the average length now is almost short, around 500 or sometimes less), I think it will affect the un-classification of these contigs using geNomad. isn't it?

There is another point I facing now, the number of CPUs and threads when using end-to-end is very high and it's not controlled using the -t option. seems the MMseqs2 is not affected using this option and fills-up the CPU completely.

Best, NP

apcamargo commented 1 year ago

Please let me know which file you prefer to share.

Can you send me the sequences? Otherwise, you can just send me the _genes.tsv and _taxonomy.tsv files in _annotate directory.

I was thinking to find a way to efficiently bin the contigs after assembly (before using geNomad, because the average length now is almost short, around 500 or sometimes less), I think it will affect the un-classification of these contigs using geNomad. isn't it?

I don't think I understood that. geNomad doesn't take binning into account. Sequences are classified independently.

There is another point I facing now, the number of CPUs and threads when using end-to-end is very high and it's not controlled using the -t option. seems the MMseqs2 is not affected using this option and fills-up the CPU completely.

MMseqs2 should respect the number of CPUs you set. I call the search command using the user-provided parameter, if MMseqs2 is using more threads than what you asked it to use, I don't think I can do anything on my side 😕

poursalavati commented 1 year ago

Here you are. These are examples of one of my data sets:

final.contigs_taxonomy.txt final.contigs_genes.txt

I don't think I understood that. geNomad doesn't take binning into account. Sequences are classified independently.

I mean, maybe the reason for this large portion of unclassified could be related to the contig length, so if I bin them and have a longer contig (consensus) maybe improve the result.

MMseqs2 should respect the number of CPUs you set. I call the search command using the user-provided parameter, if MMseqs2 is using more threads than what you asked it to use, I don't think I can do anything on my side

Thanks, I set the -t on 40, but now it's using all CPUs and filling up until 56!

apcamargo commented 1 year ago

These are examples of one of my data sets:

It seems that the unclassified contigs are the very short ones, without any genes assigned to a marker. Were these classified as virus?

I mean, maybe the reason for this large portion of unclassified could be related to the contig length, so if I bin them and have a longer contig (consensus) maybe improve the result.

Ahh, ok. It makes sense, but you would have to do it manually since geNomad doesn't support binning data yet.

Thanks, I set the -t on 40, but now it's using all CPUs and filling up until 56!

That's strange. Did you check if it is MMseqs2 that is causing this? I know that the neural network classification uses all the available cores, but MMseqs2 should respect your settings.

poursalavati commented 1 year ago

It seems that the unclassified contigs are the very short ones, without any genes assigned to a marker. Were these classified as virus?

Yes as my input material and datasets are viral enriched data. But as you also mentioned, the contig lengths are problematic.

Ahh, ok. It makes sense, but you would have to do it manually since geNomad doesn't support binning data yet.

Yeah, I'm looking for an efficient binning process. if you have any Idea it's more than welcome.

That's strange. Did you check if it is MMseqs2 that is causing this? I know that the neural network classification uses all the available cores, but MMseqs2 should respect your settings.

Oh, So maybe it's related to neural network step, Is there any suggestion to limit it or have control over it?

apcamargo commented 1 year ago

Yeah, I'm looking for an efficient binning process. if you have any Idea it's more than welcome.

Binning doesn't work well on short scaffolds because the tetranucleotide frequencies are noisy. You can try Vamb and reduce the weight of the TNF information.

Oh, So maybe it's related to neural network step, Is there any suggestion to limit it or have control over it?

Not right now. You can disable the neural network step using --disable-nn-classification.

poursalavati commented 1 year ago

Binning doesn't work well on short scaffolds because the tetranucleotide frequencies are noisy. You can try Vamb and reduce the weight of the TNF information.

Thank you for your suggestion,

Not right now. You can disable the neural network step using --disable-nn-classification

Does the deactivation of this step have a noticeable effect on taxonomy and classification?

I also have a suggestion for the final summary file; the final output file can contain the name of each contig and the names of related predicted genes along with their sequences (na and aa).

Because now I'm working with these outputs and this is how I reproduced the final file, If needed for other users:

seqkit faidx X_virus_proteins.faa cut X_virus_proteins.faa.fai -f1 | sed 's/_[^_]*$//' > ids seqkit fx2tab X_virus_proteins.faa > tab-aa paste ids tab-aa > id-tab-aa seqkit fx2tab X_virus.fna > tab-na join <(LC_ALL=C sort -t $'\t' -k 1b,1 id-tab-aa) <(LC_ALL=C sort -t $'\t' -k 1b,1 tab-na) > tab-3 join <(LC_ALL=C sort -t $'\t' -k 1b,1 tab-3) <(LC_ALL=C sort -t $'\t' -k 1b,1 X_virus_summary.tsv) > X_Final.tsv

Best, NP

apcamargo commented 1 year ago

Does the deactivation of this step have a noticeable effect on taxonomy and classification?

It won't affect the taxonomy, but will affect the classification. The neural network classification (sequence branch in the figure below) is especially important to classify sequences with few or no markers. When the scores of the neural network and the marker-based classifier are aggregated, the weight of the neural network score is maximum when you have few or no markers (box A2 below).

So, those short sequences that you have might not get classified as a virus. I don't think that this is a huge problem, as I wouldn't trust short sequences without any markers anyway.

I might add a filter in geNomad to prevent very short sequences without any markers from being classified as virus.

I also have a suggestion for the final summary file; the final output file can contain the name of each contig and the names of related predicted genes along with their sequences (na and aa).

I like the idea, but I feel this output could confuse most people as it is not standard. I prefer to provide users with simple outputs and they can work from there if they need (as you did!)

apcamargo commented 1 year ago

I'll close this issue @poursalavati. If you got any other questions, just let me know and I can reopen it

apcamargo / genomad

Non-detailed taxonomy classification. Possibility of IMG/VR database integration #9