EBI-Metagenomics / emg-viral-pipeline

VIRify: detection of phages and eukaryotic viruses from metagenomic and metatranscriptomic assemblies
Apache License 2.0
118 stars 16 forks source link

Taxonomic ranks are inverted #113

Open Ales-ibt opened 11 months ago

Ales-ibt commented 11 months ago

Hello there!

I've been testing the VIRify v2.0 and I realised that the taxonomic annotation on the GFF file has the ranks inverted.

For instance: taxonomy=Entomopoxvirinae;Poxviridae;Chitovirales

Should be: taxonomy=Chitovirales;Poxviridae;Entomopoxvirinae

And ideally, it would be great to have the whole lineage like: taxonomy=Viruses;Bamfordvirae;Nucleocytoviricota;Pokkesviricetes;Chitovirales;Poxviridae;Entomopoxvirinae

There are also some problems with names like Caudovirales which is shown in the NCBI taxonomy database as Caudoviricetes.

Thanks in advance!

Ales.

hoelzer commented 11 months ago

Hey, thx @Ales-ibt !

Yes agree, inverting the ranks would make more sense probably. Having the full ranks shown should be also possible with the NCBI taxonomy file @guille0387 , or?

Regarding the Caudovirales vs Caudoviricetes: actually Caudovirales should not be in the pipeline anymore bc the taxa was discontinued by ICTV. We added the following warning mssg when running VIRify:

Warning: --meta_version v4 does not include the following discontinued virus taxa 
(according to ICTV) anymore and they have been excluded from the dataset.
- Allolevivirus
- Autographivirinae
- Buttersvirus
- Caudovirales
- Chungbukvirus
- Incheonvirus
- Leviviridae
- Levivirus
- Mandarivirus
- Pbi1virus
- Phicbkvirus
- Radnorvirus
- Sitaravirus
- Vidavervirus
- Myoviridae
- Siphoviridae
- Podoviridae
- Viunavirus
- Orthohepevirus
- Klosneuvirus
- Hendrixvirus
- Rubulavirus
- Avulavirus
- Catovirus
- Nucleorhabdovirus
- Viunavirus
- Gammalipothrixvirus
- Peduovirinae
- Sedoreovirinae

Did you still had Caudovirales in your results? Can you try a fresh installation and most importantly re-download of the database files? Maybe an old database file was still used.

guille0387 commented 11 months ago

Hi @hoelzer @Ales-ibt

Yes, I think it should be possible to invert the order of the ranks and include the complete lineage.... let me have a look into this and I'll get back to you asap.

guille0387 commented 11 months ago

Hi @hoelzer @Ales-ibt

I created a new branch called out_lineage with modifications in the contig taxonomic assignment script. The output should now reflect the suggestions that Ales made. I tested it with the two mock datasets we used in the paper and it worked, but perhaps Ales would like to try it with her own data? Let me know if you have any issues.

hoelzer commented 11 months ago

Great, thx @guille0387 ! Looks also good for me. @Ales-ibt can you give it a try as well? thx!

Ales-ibt commented 11 months ago

Great, I'll run a test and be back to you soon.

Ales-ibt commented 10 months ago

Hello, sorry about taking that long to be back. I updated the NCBI database and now I have the correct Caudoviricetes annotation :D. I also tested the pipeline on the out_lineage branch and I can see the complete lineages beautifully sorted on the 08-final/taxonomy/*prodigal_annotation_taxonomy.tsv, thank you so much for this. The only detail is that this fix is not reflected on the GFF output file.

Thank you again!

Ales

hoelzer commented 10 months ago

Awesome, thanks for checking, @Ales-ibt !

@guille0387 can you also do the GFF fix and then we could merge that into dev @mberacochea

mberacochea commented 10 months ago

Excellent @guille0387!, thank you for that fix. Let me know if you need a hand fixing the GFF.

guille0387 commented 10 months ago

Hi Martin!

Actually, I might need help with the GFF 😅… I’m not even sure which step of the pipeline generates that file as output… if you could help me out with that it’d be great, or if you could guide me on what to do that’d be great too :)

Guillermo Rangel-Pineros Postdoc Palaeoproteomics Group

University of Copenhagen Faculty of Health and Medical Sciences The Globe Institute Øster Farimagsgade 5, bygning 7 1353 Copenhagen K DENMARK

MOB +45 50 10 57 42 @.**@.>

On 10 Oct 2023, at 11.24, Martín Beracochea @.***> wrote:

Excellent @guille0387https://github.com/guille0387!, thank you for that fix. Let me know if you need a hand fixing the GFF.

— Reply to this email directly, view it on GitHubhttps://github.com/EBI-Metagenomics/emg-viral-pipeline/issues/113#issuecomment-1754803094, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGFWEOPYV2EHWQBPHWTD6VLX6UH43AVCNFSM6AAAAAA4O7ZXJKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJUHAYDGMBZGQ. You are receiving this because you were mentioned.Message ID: @.***>

mberacochea commented 1 month ago

Hey folks,

I'm trying to catch up with the virify backlog, there is an excellent PR #84 to add support for Virsorter2 so it's perfect oporunity to make a new release including also this fix.

Cheers

hoelzer commented 4 weeks ago

Hey, yes agree that would be perfect to have another release with VS2 support and some of the current open issues resolved.

I think here everything was solved

I created a new branch called out_lineage with modifications in the contig taxonomic assignment script.

just not the change of taxonomic rank orders in the GFF... Ah, or this was done in #129 @mberacochea ? Then this issue should be solved