DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
237 stars 73 forks source link

converting seqId to readable names #176

Open rchikhi opened 4 years ago

rchikhi commented 4 years ago

Hi,

When using Centrifuge using the nt database, I couldn't find a way to convert the returned seqId's to their corresponding sequence names. Unless I missed something, here is an Entrez database query code that does this task:

echo $seqId | python -c "import sys; from Bio import Entrez, SeqIO; Entrez.email = 'A.N.Other@example.com'; print(SeqIO.read(Entrez.efetch(db='nucleotide', id=Entrez.read(Entrez.esearch(db='nuccore', term=sys.stdin.read()))['IdList'][0], rettype='gb', retmode='text'), 'genbank').description)"

With seqId=CU928148.1, it returns:

Escherichia coli str. UMN026 plasmid p1ESCUM, complete genome

Best, Rayan

mourisl commented 4 years ago

You can use centrifuge-inspect with the option --conversion-table to generate the conversion table from seqId to taxonomy id. And then you can further use centrifuge-inspect with --name-table option( or the other methods) to map the taxonomy id to taxonomy name. Is this what you needed? Thanks.

rchikhi commented 4 years ago

Hi, thanks for the response. I believe the method you suggest doesn't really give me the true sequence Id (it would give only the name of the species as per taxonomy name, right?). I'd like to get the sequence name to e.g. determine whether that sequence is a plasmid or not. Best, Rayan