NCBI-Hackathons / seqr

Creative Commons Zero v1.0 Universal
12 stars 2 forks source link

Blast Databse to JSON/Solr index #23

Open averagehat opened 8 years ago

averagehat commented 8 years ago

We can imagine that someone (say, me) wants to dump their existing blast database (say nr/nt) into something Seqr-compatible.

blastdbcmd can dump FASTA entries like so:

>gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] >gi|1705556|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=CAF-1 >gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium discoideum AX2] >gi|60470106|gb|EAL68086.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK

We have an index command but it doesn't know about the metadata between the |. @lianyi Do you have anything for this?

lianyi commented 8 years ago

It seems to be a convention, that concatenate the db|id with |. Perhaps we can use a regexp to parse out the metadata if needed.

lewisg-ncbi commented 8 years ago

Hi Mike,

Yes, it would be very helpful to have such a program. This was one of the "reach" goals for the hackathon and not that difficult to do…

Best, Lewis

From: Mike Panciera [mailto:notifications@github.com] Sent: Friday, August 21, 2015 11:25 AM To: DCGenomics/seqr seqr@noreply.github.com Subject: [seqr] Blast Databse to JSON/Solr index (#23)

We can imagine that someone (say, me) wants to dump their existing blast database (say nr/nt) into something Seqr-compatible.

blastdbcmd can dump FASTA entries like so:

gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] >gi|1705556|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=CAF-1 >gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium discoideum AX2] >gi|60470106|gb|EAL68086.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]

MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY

KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK

@lianyihttps://github.com/lianyi Do you have anything for this?

— Reply to this email directly or view it on GitHubhttps://github.com/DCGenomics/seqr/issues/23.

averagehat commented 8 years ago

One can specify the output format of blastbdcmd so maybe the thing to do is have it output a TSV file which Seqr could accept as an alternative to JSON

I will try this and see how it works out

DCGenomics commented 8 years ago

Let me know if y'all want to see interface with tom madden, head of blast.

Cheers!

Ben On Aug 26, 2015 12:32 PM, "Mike Panciera" notifications@github.com wrote:

One can specify the output format of blastbdcmd so maybe the thing to do is have it output a TSV file which Seqr could accept as an alternative to JSON

I will try this and see how it works out

— Reply to this email directly or view it on GitHub https://github.com/DCGenomics/seqr/issues/23#issuecomment-135095689.

averagehat commented 8 years ago

I discovered that using the blastdbcmd with outfmt options, i.e.

blastdbcmd -db databases/ncbi/blast/nr/nr -entry all -outfmt "%s,%a,%g,%o,%i,%t,%l,%h,%T,%X,%e,%L,%C,%S,%N,%B,%K,%P" -target_only 

Takes a (prohibitively?) long time to run (and can't be parallelized simply, as far as I know). Running it to dump into FASTA format is much faster, but you lose some of the metadata, it seems.

lewisg-ncbi commented 8 years ago

Mike,

It could be that some of the fields you are requesting are slow, but some are fast. blast stores data in multiple files.

Best, Lewis

From: Mike Panciera [mailto:notifications@github.com] Sent: Wednesday, September 02, 2015 5:44 PM To: NCBI-Hackathons/seqr seqr@noreply.github.com Cc: Geer, Lewis (NIH/NLM/NCBI) [E] lewisg@ncbi.nlm.nih.gov Subject: Re: [seqr] Blast Databse to JSON/Solr index (#23)

I discovered that using the blastdbcmd with outfmt options, i.e.

blastdbcmd -db databases/ncbi/blast/nr/nr -entry all -outfmt "%s,%a,%g,%o,%i,%t,%l,%h,%T,%X,%e,%L,%C,%S,%N,%B,%K,%P" -target_only

Takes a (prohibitively?) long time to run (and can't be parallelized simply, as far as I know). Running it to dump into FASTA format is much faster, but you lose some of the metadata, it seems.

— Reply to this email directly or view it on GitHubhttps://github.com/NCBI-Hackathons/seqr/issues/23#issuecomment-137253016.