blahah / transrate

Understand your transcriptome assembly
http://hibberdlab.com/transrate
Other
100 stars 34 forks source link

Add support for BLAST DB aliases #140

Open standage opened 9 years ago

standage commented 9 years ago

I have the NCBI non-redundant protein database downloaded on my server. When I want to make species- or clade-specific databases (i.e. all the Hymenoptera, or all the Hexapoda, etc.), instead of downloading those sequences again in Fasta format I simply download the GI numbers from NCBI and use blastdb_aliastool to create a database alias for that data set. No new sequences are downloaded, and searches against the db alias are restricted just to those GI numbers. This approach ends up saving a lot of space and time.

When I try to run transrate against a db alias, I get the message Reference fasta file does not exist.

Feature request: add a --dbalias flag to indicate that the value passed to the --reference option is not a Fasta file to be indexed, but a db alias. The presence of the alias can be tested by looking for ${reference}.pal, and ${reference} can be passed directly to the BLAST command as its -db option. The change should be fairly simple, and I would implement this myself and submit a PR, except that I have no experience with Ruby or Transrate's internals. :(

blahah commented 9 years ago

Thanks for the suggestion, this looks like a good idea. We actually use the reference FASTA in two ways:

  1. in transrate (here and here), to create a hash that allows rapidly retrieving details of a reference sequence (useful when calculating coverage)
  2. in CRB-BLAST (here), to check what kind of sequences are in the reference (prot/nuc)

Both of these can be worked around, so we'll dig into it.

cc @cboursnell