Closed macmanes closed 9 years ago
I could fix this in vsearch if it does not conflict with usearch compatibility.
What was the vsearch command line used to generate 20M.ec.P2.score.vsearch.Trinity.fasta?
Have you previously generated this file using usearch and obtained results without the semicolon?
Another vsearch user recently asked me to put the semicolon in at the end of the header line, but it might be in a different context. See vsearch issue 36.
vsearch command:
vsearch --fasta_width 0 --threads 20 --id .99 \
--cluster_fast 20M.ec.P2.score.Trinity.fa --consout vsearch.fa
To vsearch.fa I added the long contigs (>15000bp) to the final assembly file as per https://github.com/torognes/vsearch/issues/52 . Let's see what @Blahah has to say. If this is a transrate fasta parsing that can be easily remedied, then maybe that is the most parsimonious fix?
FWIW, blast seems to ignore that trailing semicolon. Could be that it causes problems in various applications.
blastn tabular output
centroid=c16726_g1_i10;seqs=1 centroid=c16726_g1_i8;seqs=1 100.00 2029 0 0 7838 9866 7889 9917 0.0 3747
It seems like usearch includes that trailing semicolon in a similar command so I do not want to remove it in vsearch.
the problem appears to be that BLAST is stripping the trailing semicolon in its output, whereas the BioRuby FASTA defline parser doesn't strip it. So when we're matching up BLAST hits with contigs, they don't match. I don't think this is VSEARCH's fault (or transrate's) - it's BLAST that's doing the weird thing.
However, since BLAST is a lumbering leviathan that we are unlikely to be able to change, we could add a workaround where we strip any trailing punctuation from FASTA deflines. The only concern with that would be if a FASTA were to have entries that had identical deflines except for the trailing punctuation, but that's pretty far-fetched.
Something like sed -i s'/;$//' filename
should work, assuming we're only targeting semicolons!
This will be in the next beta @macmanes - should be out in the next day or so
Transrate beta1, tripping up during analysis of comparative metrics, on vsearch headers.
Contig and read-based metrics run fine, but at comparative metrics section broken:
It's complaining about the contig
centroid=c15935_g1_i1;seqs=2
not being in assembly, it's there, but the header in the assemble is: (note trailing semicolon)Not sure whether this problem could be addressed in vsearch by @torognes (remove trailing ;) or by you guys, or if this is a blast fasta parsing issue.