Closed camillescott closed 8 years ago
We take the entry_id
because BLAST mangles names itself, and if we don't do it ourselves we can't match up the results.
However, we're doing it wrong here. We should be using the BioRuby FASTA defline parser from BioRuby to parse out the entry_id
:
Bio::FastaDefline.new(somestr).entry_id
It's still not perfect because of the knock-on effects of the great NCBI FASTA format controversy, but it handles the commonest cases.
@camillescott do you have some example deflines we can use for a regression test?
Really any non-NCBI defline with a pipe in it. Here are some uniprot ones from one of my databases:
tr|C3Y8C9|C3Y8C9_BRAFL sp|O47426|ATP6_BRAFL sp|C4A0D9|BAP1_BRAFL
The target name appears to be mangled during output, where it is split on pipes and only the first token used. This breaks on many databases; for example, uniprot sequences are formatted|| (or something to that effect), and with this system all you get is (which is often just "tr" or "sp" or something equally uninformative).