bbuchfink / diamond

Accelerated BLAST compatible local sequence aligner.
GNU General Public License v3.0
1.03k stars 183 forks source link

Getting taxonomic output from Diamond #256

Closed jbatovska closed 5 years ago

jbatovska commented 5 years ago

Hello,

I am trying to get Diamond output that includes taxonomic information (similar to sscinames and sskingdoms in blast) and have made a database which included the –taxonmap and –taxonnodes parameters. I am using the 102 taxonomic classification output, and this is giving me the query ID, the NCBI taxonomic ID and the evalue.

How do I then turn this into something that will tell me the species/family/kingdom of the match? I have files with thousands of sequences and it would be really handy if there was an output that could tell me the species/family/kingdom of each match, instead of just a number?

Apologies if this has been addressed elsewhere, I just wasn't sure of how to proceed.

Thank you for your help.

bbuchfink commented 5 years ago

Hi,

this feature is not available in Diamond at the moment but I could probably implement it within the next week or so. The info you require is contained in the taxdmp.zip file from NCBI. The names.dmp file can be used to map taxon ids to names, so you could also write some simple python script to perform this mapping.

Best regards,

Benjamin

jbatovska commented 5 years ago

Thanks for the information. If this feature ends up being implemented in Diamond, could you please let me know via this thread? Thank you very much.

bbuchfink commented 5 years ago

Sure, it shouldn't be too long.

alexpiper commented 5 years ago

Hi Benjamin

Any idea on a time-frame for this feature? I'm about to put together a new pipeline implementing DIAMOND and this would greatly simplify things.

Thanks!

bbuchfink commented 5 years ago

Hi Alexander,

I'll try to get it implemented within 1 week (but can't make any promises).

bbuchfink commented 5 years ago

The latest commit now supports the sscinames output field. For makedb you have to use the --taxonnames parameter to specify the path to the names.dmp file from the NCBI taxdmp.zip in addition to the --taxonmap parameter.

Let me know if you need anything else.

alexpiper commented 5 years ago

Thanks for implementing this so quickly! Greatly appreciated.

jbatovska commented 5 years ago

Thank you for implementing this!

elarsonneur commented 5 years ago

Hi @bbuchfink, Thank for implementing this feature. Do you plan for a new release soon ?

bbuchfink commented 5 years ago

Yes, there should be a new release within 1-2 weeks.

Gurlaz commented 3 years ago

The latest commit now supports the sscinames output field. For makedb you have to use the --taxonnames parameter to specify the path to the names.dmp file from the NCBI taxdmp.zip in addition to the --taxonmap parameter.

Let me know if you need anything else.

Hello, this thread helped me with a similar issue. Thank you! I am having trouble using --taxonnames parameter while making the database now and getting an error saying it is an Invalid option. --taxonmap and --taxonnodes is working fine.

bbuchfink commented 3 years ago

What version of diamond are you using? Check with diamond --version.

Gurlaz commented 3 years ago

diamond --version

It is diamond version 0.9.14

bbuchfink commented 3 years ago

You will need to use a later version then, the option was not supported back then.

yangsungig commented 1 year ago

Could I output with '-f 102' format, and with 'sscinames' as a column, instead of 'NCBI taxonomy ID'.

bbuchfink commented 1 year ago

Could I output with '-f 102' format, and with 'sscinames' as a column, instead of 'NCBI taxonomy ID'.

No that is not supported at the moment.

Mathildebd commented 1 year ago

I am also looking for a way to obtain the taxonomic name of the Diamond blast hits, is there really no way? Alternatively, did anyone write a script which can return the scientific name from a list of 'NCBI taxonomy ID'? I have thousands of sequences.... Would be greatful for help with this, I thought I fixed it by building the nr database with the --taxonmap and --taxonnodes but follow this discussion here it seems it has been a feature that worked, but is no longer.....?

bbuchfink commented 1 year ago

I am also looking for a way to obtain the taxonomic name of the Diamond blast hits, is there really no way? Alternatively, did anyone write a script which can return the scientific name from a list of 'NCBI taxonomy ID'? I have thousands of sequences.... Would be greatful for help with this, I thought I fixed it by building the nr database with the --taxonmap and --taxonnodes but follow this discussion here it seems it has been a feature that worked, but is no longer.....?

Do you need this for the tabular output format (-f 6) or the taxonomy output format?

Mathildebd commented 1 year ago

Sorry for late reply, I was out of office for a few days. I tried now this setting:

-f 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids sscinames stitle --top 2

which gave me an output like this:

seq1 MBZ5604617.1 75.0 76 19 0 2 229 133 208 2.04e-29 120 2081523 2081523 MBZ5604617.1 insulinase family protein [Acidobacteriia bacterium]

I am currently only interested in the last information of taxon ID in the [], but much rather I would like to have the full taxonomic string from domain, phyla etc. etc.... Is that possible?

bbuchfink commented 1 year ago

You need to build the database with the --taxonnames option. Due to a bug in the current version, you should build it using v2.0.15. Then the output fields sscinames, sskingdoms (super kingdom), skingdoms and sphylums should be available. Other ranks are not supported, although it would be possible to add this in a future version.

Mathildebd commented 1 year ago

I worked perfectly - thank you!

katievigil commented 11 months ago

Hi can anyone get the full taxonomy list in their output?

I would like to get families and subfamilies if possible.

qseqid sseqid pident length mismatch evalue bitscore staxids sscinames sskingdoms skingdoms sphylums stitle
b10d4f95-363e-42b1-9b35-1eca05f80366 YP_004894429.1 40.2 102 61 4.08E-17 78.6 1094892 1094892 0 0 0 YP_004894429.1 clamp loader of DNA polymerase [Megavirus chiliensis]

Here is my code:

diamond blastx -d /lustre/project/taw/kvigil/Reference/viralprotein.dmnd -q barcode04.fastq.gz -o ONR030223barcode04.tsv --ultra-sensitive -f 6 qseqid sseqid pident length mismatch evalue bitscore staxids sscinames sskingdoms skingdoms sphylums stitle

Do I need to add anything extra for long-read nanopore sequences when I execute Diamond?

Mathildebd commented 11 months ago

@lucyintheskyzzz As I uderstand getting the lower taxonomic ranks (from phyla down to species) is not supported by Diamond (yet?). I would be really interrested in this too, I am currently trying ot use MEGAN (meganizer) to obtain this output, but it's not working for me and I have waited a few weeks now for online support for the issues I am having with that approach (Megan community)......

bbuchfink commented 11 months ago

Hi can anyone get the full taxonomy list in their output?

I would like to get families and subfamilies if possible.

qseqid sseqid pident length mismatch evalue bitscore staxids sscinames sskingdoms skingdoms sphylums stitle b10d4f95-363e-42b1-9b35-1eca05f80366 YP_004894429.1 40.2 102 61 4.08E-17 78.6 1094892 1094892 0 0 0 YP_004894429.1 clamp loader of DNA polymerase [Megavirus chiliensis] Here is my code:

diamond blastx -d /lustre/project/taw/kvigil/Reference/viralprotein.dmnd -q barcode04.fastq.gz -o ONR030223barcode04.tsv --ultra-sensitive -f 6 qseqid sseqid pident length mismatch evalue bitscore staxids sscinames sskingdoms skingdoms sphylums stitle

Do I need to add anything extra for long-read nanopore sequences when I execute Diamond?

I'm not sure what you mean, is the problem that the sskingdoms, skingdoms and sphylums fields are 0 or do you need other fields?

katievigil commented 11 months ago

Hi @bbuchfink I am actually getting the kingdoms and phylums, but I would like to get class, family, subfamily, genus and species if possible?- The data with zeros are viruses that have not been classified yet. Usually for metagenomic sequencing publications with viruses, people use the viral family to create figures.

@Mathildebd yes I have asked on the megan community many times how to export a .tsv file similar to Diamond and I still can't figure it out. Everything exports as a .txt file in MEGAN and when I switch it to .csv all the data is still in one column on excel. The output from Diamond is what I have been looking for, I just need class, family , subfamily, genus and species if possible. This would be amazing!

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

length | mismatch | evalue | bitscore | staxids | sscinames | sskingdoms | skingdoms | sphylums | stitle -- | -- | -- | -- | -- | -- | -- | -- | -- | -- 37 | 19 | 1.09E-05 | 45.8 | 2267653 | Erwinia phage Wellington | Viruses | Heunggongvirae | Uroviricota | YP_009806609.1 hypothetical protein HOT70_gp176 [Erwinia phage Wellington]

katievigil commented 11 months ago

@Mathildebd were you able to get a table with family, subfamily, genus and specie from the blastx hits on Diamond or Diamond+Megan? I have samples with >10,000 viral hits and it is taking forever to add all the families manually. Thanks!

Mathildebd commented 11 months ago

Hi again. So, looking at an earlier answer from @bbuchfink in this treads (to me back in Apr 19) he writes:

[besides super kingdom, kingdom and phyla] Other ranks are not supported, although it would be possible to add this in a future version."

So unless it was implemented I think it isn't possible. Yesterday I recieve help from Daniel Huson (megan) and you can follow the discussion here (megan community). I hope I will have it working wihtin a week or so. Good luck :)

Mathildebd commented 10 months ago

Hi @bbuchfink ,

Any chance those lower rank taxonomies will be available for the Diamond LCA in future updates? (and is there an ETA on that?) Since I am still struggling with the megan application.... And the whole work flow would be so much easier if I could just stick to Diamond.

katievigil commented 10 months ago

@Mathildebd I agree! Here is a temporary solution using Python (see github issue beloq)! I am meeting with the Megan group tomorrow on zoom at 9am Central Standard Time (USA) If you like to join? They are going to create a script on how to export a .csv file just like diamond but with all the taxonomy.

Please find the link to our meeting — Friday, 9 am — attached below.

Computomics Conference is inviting you to a scheduled Zoom meeting.

Join Zoom Meeting https://us06web.zoom.us/j/87981410443?pwd=lJN2MEqzYkX0uaqDcgwRG59aLBeNwX.1 Meeting ID: 879 8141 0443 Passcode: 583967

Below is the python script I wrote that exports the taxonomy: https://github.com/bbuchfink/diamond/issues/719#issuecomment-1786371748

Mathildebd commented 10 months ago

Thank you for inviting me ! Unfortunately I can't make the time work (it would be 3pm here and I have to pick up my son). I am very curious to hear if you find a feasible solution. Just FYI I am not particularly looking at virus taxonomy but "everything" (mainly bacteria).

katievigil commented 10 months ago

@Mathildebd are you able to meganize your daa file without an error?

Mathildebd commented 10 months ago

yes, I believe so. had to increase the memory usage and then it worked fine.

bbuchfink commented 10 months ago

Hi @bbuchfink ,

Any chance those lower rank taxonomies will be available for the Diamond LCA in future updates? (and is there an ETA on that?) Since I am still struggling with the megan application.... And the whole work flow would be so much easier if I could just stick to Diamond.

I hope I'll be able to implement it in the next weeks.

Mathildebd commented 10 months ago

@bbuchfink That would be absolutely amazing ! Thank you !

katievigil commented 10 months ago

@bbuchfink @Mathildebd wow amazing news! I got my .daa files to meganize, then I went into megan and exported the taxonomic names using readName_to_taxonPathKPCOFGS and I lost alot of reads doing the megan export. I was able to do the VLOOKUP on excel, but the export from Megan had only 1473 rows, whereas my Diamond .tsv file has 28922 rows, so most of the data got lost after doing the MEGAN export function : readName_to_taxonPathKPCOFGS.

This will help me ALOT if I can just use diamond to create my .tsv file, then I am done! <3 AMAZING!

katievigil commented 10 months ago

Hi @Mathildebd if you are familiar with R, this taxonomizr R package worked great to scan the NCBI taxonomy ID's from the Diamond .tsv output and adds the whole taxonomic path into a nice .csv file. You should try it out if you like while we wait for the Diamond update. https://cran.r-project.org/web/packages/taxonomizr/readme/README.html

This is faster than using a python Bio Entrez script.

Mathildebd commented 9 months ago

Hi @bbuchfink, any updates regarding this implementation?

bbuchfink commented 9 months ago

Sorry no, didn't find the time for it yet.

bbuchfink commented 7 months ago

The latest release now has the option --include-lineage to include the taxonomic lineage in the output for the taxonomic classification format. I assume this is what you needed, but feel free to ask if there is still anything missing.

katievigil commented 7 months ago

@bbuchfink Thanks so much! I will try it out and let you know what I get!

katievigil commented 7 months ago

Hi I updated diamond and I got this error when I went to use --include-lineage:

diamond blastx -d /lustre/project/taw/kvigil/Reference/viralprotein.dmnd -q oysterspooled030223.fastq.gz -o ONR030223oysterspooled.tsv --ultra-sensitive --include-lineage -f 6 q seqid sseqid pident length mismatch evalue bitscore staxids sscinames sskingdoms skingdoms sphylums stitle Error: Invalid option: include-lineage

bbuchfink commented 7 months ago

Please double check that you are running the latest version i.e. diamond version. Also note that the --include-lineage option is only available for the taxonomic format (-f 102). The same for the tabular format is still something I would have to implement.

Mathildebd commented 7 months ago

that is great - thank you @bbuchfink. In the meantime I found a well-working solution using taxopy: https://github.com/apcamargo/taxopy

katievigil commented 4 months ago

@bbuchfink Hi I am going to run the latest diamond with the updated taxonomy, does this look correct?

(diamond_env) [kvigil@qbc2 diamond]$ diamond --version diamond version 2.1.9

Run DIAMOND on each input fastq.gz file

for file in /work/kvigil/hecatomb/onr.raw.data; do base=$(basename "$file" .fastq.gz) output_file="${output_directory}/${base}.tsv" diamond blastx -d /work/kvigil/diamond/viralprotein.050124.dmnd -q "$file" -o "$output_file" --sensitive --long-reads --include-lineage -f 102 done