Open averagehat opened 8 years ago
It seems to be a convention, that concatenate the db|id with |. Perhaps we can use a regexp to parse out the metadata if needed.
Hi Mike,
Yes, it would be very helpful to have such a program. This was one of the "reach" goals for the hackathon and not that difficult to do…
Best, Lewis
From: Mike Panciera [mailto:notifications@github.com] Sent: Friday, August 21, 2015 11:25 AM To: DCGenomics/seqr seqr@noreply.github.com Subject: [seqr] Blast Databse to JSON/Solr index (#23)
We can imagine that someone (say, me) wants to dump their existing blast database (say nr/nt) into something Seqr-compatible.
blastdbcmd can dump FASTA entries like so:
gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] >gi|1705556|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=CAF-1 >gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium discoideum AX2] >gi|60470106|gb|EAL68086.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK
@lianyihttps://github.com/lianyi Do you have anything for this?
— Reply to this email directly or view it on GitHubhttps://github.com/DCGenomics/seqr/issues/23.
One can specify the output format of blastbdcmd
so maybe the thing to do is have it output a TSV file which Seqr could accept as an alternative to JSON
I will try this and see how it works out
Let me know if y'all want to see interface with tom madden, head of blast.
Cheers!
Ben On Aug 26, 2015 12:32 PM, "Mike Panciera" notifications@github.com wrote:
One can specify the output format of blastbdcmd so maybe the thing to do is have it output a TSV file which Seqr could accept as an alternative to JSON
I will try this and see how it works out
— Reply to this email directly or view it on GitHub https://github.com/DCGenomics/seqr/issues/23#issuecomment-135095689.
I discovered that using the blastdbcmd
with outfmt options, i.e.
blastdbcmd -db databases/ncbi/blast/nr/nr -entry all -outfmt "%s,%a,%g,%o,%i,%t,%l,%h,%T,%X,%e,%L,%C,%S,%N,%B,%K,%P" -target_only
Takes a (prohibitively?) long time to run (and can't be parallelized simply, as far as I know). Running it to dump into FASTA format is much faster, but you lose some of the metadata, it seems.
Mike,
It could be that some of the fields you are requesting are slow, but some are fast. blast stores data in multiple files.
Best, Lewis
From: Mike Panciera [mailto:notifications@github.com] Sent: Wednesday, September 02, 2015 5:44 PM To: NCBI-Hackathons/seqr seqr@noreply.github.com Cc: Geer, Lewis (NIH/NLM/NCBI) [E] lewisg@ncbi.nlm.nih.gov Subject: Re: [seqr] Blast Databse to JSON/Solr index (#23)
I discovered that using the blastdbcmd with outfmt options, i.e.
blastdbcmd -db databases/ncbi/blast/nr/nr -entry all -outfmt "%s,%a,%g,%o,%i,%t,%l,%h,%T,%X,%e,%L,%C,%S,%N,%B,%K,%P" -target_only
Takes a (prohibitively?) long time to run (and can't be parallelized simply, as far as I know). Running it to dump into FASTA format is much faster, but you lose some of the metadata, it seems.
— Reply to this email directly or view it on GitHubhttps://github.com/NCBI-Hackathons/seqr/issues/23#issuecomment-137253016.
We can imagine that someone (say, me) wants to dump their existing blast database (say nr/nt) into something Seqr-compatible.
blastdbcmd can dump FASTA entries like so:
We have an index command but it doesn't know about the metadata between the
|
. @lianyi Do you have anything for this?