cboursnell / crb-blast

Conditional Reciprocal Best Blast
40 stars 10 forks source link

Remove target name mangling #7

Closed camillescott closed 8 years ago

camillescott commented 9 years ago

The target name appears to be mangled during output, where it is split on pipes and only the first token used. This breaks on many databases; for example, uniprot sequences are formatted || (or something to that effect), and with this system all you get is (which is often just "tr" or "sp" or something equally uninformative).

blahah commented 8 years ago

We take the entry_id because BLAST mangles names itself, and if we don't do it ourselves we can't match up the results.

However, we're doing it wrong here. We should be using the BioRuby FASTA defline parser from BioRuby to parse out the entry_id:

Bio::FastaDefline.new(somestr).entry_id

It's still not perfect because of the knock-on effects of the great NCBI FASTA format controversy, but it handles the commonest cases.

blahah commented 8 years ago

@camillescott do you have some example deflines we can use for a regression test?

camillescott commented 8 years ago

Really any non-NCBI defline with a pipe in it. Here are some uniprot ones from one of my databases:

tr|C3Y8C9|C3Y8C9_BRAFL sp|O47426|ATP6_BRAFL sp|C4A0D9|BAP1_BRAFL