bbuchfink / diamond

Accelerated BLAST compatible local sequence aligner.
GNU General Public License v3.0
1.07k stars 182 forks source link

option to obtain a single sequence from diamond database #271

Closed pcantalupo closed 4 years ago

pcantalupo commented 5 years ago

Hello, I'd like to extract a single sequence from a Diamond database. Is this possible? It would be analogous to this command in BLAST+

blastdbcmd -db ref_viruses_rep_genomes -entry NC_001543

Thank you

bbuchfink commented 5 years ago

The only way at the moment is to use diamond getseq -d dbfile to get all sequences as FASTA, then filter the one you want. It is not very efficient though.

terrycojones commented 5 years ago

Hi @pcantalupo. If you're using Python, I have some code you can use to do this.

If you install https://github.com/acorg/dark-matter/ there's a command in the bin directory called make-fasta-database.py. Run that with the FASTA you give to DIAMOND and it will create you an sqlite3 database file. You can then use the SqliteIndex class in dark/fasta.py (see https://github.com/acorg/dark-matter/blob/master/dark/fasta.py#L229) in your code:

from __future__ import print_statement
from dark.fasta import SqliteIndex

sequences = SqliteIndex('database-file.sqlite3')

# sequences can be used like a dictionary and will do the database lookup behind the scenes.
s = sequences['sequence-id']

# s is an instance of dark.reads.Read, so:
print(s.sequence, s.id, len(s))
print(s.toString('fasta'), end='')

Note that the sequence ids in the dark-matter code do not get truncated at the first space, so the sequence-id bit in the above has to be the full sequence id, as given in the FASTA file. The make-fasta-database.py command can be given multiple files and can read gzip and BGZF format. The sequences are not stored in the database, their offsets are, so a lookup results in an open of the original file and a seek to get the sequence. That means you can't rename the original input FASTA files (though you can put them in another directory). It would be a little faster to put the sequences into the database, but I didn't want to do that as it would double storage.

Hope that helps.