ArifaKhanLab / RVDB

A reference viral database (RVDB)
https://rvdb.dbi.udel.edu/
26 stars 10 forks source link

Full taxonomy #6

Open srosales712 opened 4 years ago

srosales712 commented 4 years ago

Hi is there a way to get the full taxonomy for all the sequences on the RVDB. I used NCBI's Entrez Programming Utilities to extract taxonomy from accession numbers from my TBlastx results, but only a fraction of them associated with a taxon ID. Any suggestions on how to get the complete taxonomy lineage?

Thanks, Stephanie

bosborne commented 4 years ago

Stephanie,

Which accession numbers didn’t link to a Taxonomy id?

Don’t need all accessions, just some representatives. I ask because I’m running into a similar issue using Entrez.

Another way to address this same issue: does this file have the links you’re looking for?

ftp://ftp.ncbi.nih.gov//pub/taxonomy/accession2taxid/prot.accession2taxid.gz

Brian O.

On Jun 2, 2020, at 4:28 PM, srosales712 notifications@github.com wrote:

Hi is there a way to get the full taxonomy for all the sequences on the RVDB. I used NCBI's Entrez Programming Utilities to extract taxonomy from accession numbers from my TBlastx results, but only a fraction of them associated with a taxon ID. Any suggestions on how to get the complete taxonomy lineage?

Thanks, Stephanie

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArifaKhanLab/RVDB/issues/6, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACD35BS3Y7X6NVMY4UEBCTRUVOIPANCNFSM4NRBW4GA.

srosales712 commented 4 years ago

Hi Brian, It looks like Entrez is annotating all my accession numbers. The command I ran just removes duplicate accession numbers rather than providing output for each row. So although I had a thousand plus entries I only had 300 unique accession numbers and so the program output 300 tax IDs.

Here is the command I ran just in case anyone else runs into this problem or if you have a nicer solution. cat accesion.txt | epost -db nuccore | esummary -db nuccore | xtract -pattern DocumentSummary -element Caption,TaxId > taxid.txt

Thanks, Stephanie

a7032018 commented 4 years ago

It is probably because most of the viruses haven't been "taxonomically classified". You can simply put out some of the entries lacking TaxID, and check if they really have no taxonomy ID with GenBank search by their AccID

bosborne commented 4 years ago

I don’t think that’s it. NCBI is happy to put sequence entries in unclassified “clades” like this:

https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Tree&id=35255&lvl=3&lin=f&keep=1&srchmode=1&unlock https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Tree&id=35255&lvl=3&lin=f&keep=1&srchmode=1&unlock

I believe they “classify” everything, but I’d like to see an example of a protein without a Taxonomy id, if you have one. I’m not a NCBI Taxonomy expert, I just use Entrez.

But actually I’m not even talking about viruses, I’m talking about entries like this:

https://www.ncbi.nlm.nih.gov/protein/HHZ01689.1 https://www.ncbi.nlm.nih.gov/protein/HHZ01689.1

It has a taxon id, but Entrez will not return it, given the accession, as part of a batch eLink query.

On Jun 4, 2020, at 8:38 AM, a7032018 notifications@github.com wrote:

It is probably because most of the viruses haven't been "taxonomically classified". You can simply put out some of the entries lacking TaxID, and check if they really have no taxonomy ID with GenBank search by their AccID

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArifaKhanLab/RVDB/issues/6#issuecomment-638820887, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACD35BNKPLPBSSEET37JLDRU6ISVANCNFSM4NRBW4GA.

ArifaKhanLab commented 4 years ago

Hi Stephanie, I would suggest you to fetch the TaxID by what Brian has mentioned by mapping your AccId to prot.accession2taxid.gz available at NCBI FTP website.