Closed pcantalupo closed 4 years ago
The only way at the moment is to use diamond getseq -d dbfile
to get all sequences as FASTA, then filter the one you want. It is not very efficient though.
Hi @pcantalupo. If you're using Python, I have some code you can use to do this.
If you install https://github.com/acorg/dark-matter/ there's a command in the bin
directory called make-fasta-database.py
. Run that with the FASTA you give to DIAMOND and it will create you an sqlite3 database file. You can then use the SqliteIndex
class in dark/fasta.py
(see https://github.com/acorg/dark-matter/blob/master/dark/fasta.py#L229) in your code:
from __future__ import print_statement
from dark.fasta import SqliteIndex
sequences = SqliteIndex('database-file.sqlite3')
# sequences can be used like a dictionary and will do the database lookup behind the scenes.
s = sequences['sequence-id']
# s is an instance of dark.reads.Read, so:
print(s.sequence, s.id, len(s))
print(s.toString('fasta'), end='')
Note that the sequence ids in the dark-matter code do not get truncated at the first space, so the sequence-id
bit in the above has to be the full sequence id, as given in the FASTA file. The make-fasta-database.py
command can be given multiple files and can read gzip
and BGZF
format. The sequences are not stored in the database, their offsets are, so a lookup results in an open of the original file and a seek to get the sequence. That means you can't rename the original input FASTA files (though you can put them in another directory). It would be a little faster to put the sequences into the database, but I didn't want to do that as it would double storage.
Hope that helps.
Hello, I'd like to extract a single sequence from a Diamond database. Is this possible? It would be analogous to this command in BLAST+
blastdbcmd -db ref_viruses_rep_genomes -entry NC_001543
Thank you