Open macmanes opened 9 years ago
I'm not sure yet about the best solution to this problem.
The problem stems from the fact that Trinity has moved to using a system that is very non-standard in practise, but that the NCBI claims is standard and uses in its own headers. NCBI databases like genbank use the format >db_identifier|sequence_identifier
, while almost every other system uses >sequence_identifier|other description
.
Transrate uses the BioRuby FASTA parser, which has a pretty sensible design - it interprets everything before the first |
or space as the sequence identifier, except in special cases for the NCBI databases.
So the question is, do we extend the parser within Transrate to add a new special case for Trinity? Or modify the upstream BioRuby? Or do we ask the Trinity people to not follow this route (which seems to me to be likely to break a lot of tools).
cc @cboursnell
The Trinity devs have agreed to change their output FASTA entry ID format to TR_c_g_I
rather than TR|c_g_i
, which will solve the problem from their next release onwards (see https://github.com/trinityrnaseq/trinityrnaseq/issues/21).
We still have the issue that any assemblies produced using the affected versions of Trinity will cause a parsing problem. Two options come to mind:
Just ran into this problem and solved it temporarily with a quick sed command. I think either fix suggested for the long term would work.
You mentioned that NCBI's syntax is non-standard in practice, although that has not been my experience. I've been referred to ftp://ftp.ncbi.nih.gov/blast/documents/formatdb.html (see Fasta Defline Format section) several times as the authoritative spec on the topic, and I've seen its use (especially the gnl|DbName|Accession
format) quite a bit out in the wild.
I've now added some extra information to the error message that explains why Trinity might be the problem, and how to fix if it so.
The provided sed command can do some very strange things (underscores at the beginning of the fasta header and at the beginning of the assembled transcript). I found using:
sed -e 's/|/_/g'
worked a bit better.
@peterdfields thanks - yes, the command in v1.0.1
only works for the Mac version of OSX. In v1.0.2
we give separate linux and OSX commands. Which version are you using?
v1.0.1. Time to update!
There has been a change in the https://github.com/trinityrnaseq/trinityrnaseq fasta headers that does not work well with the Transrate fasta parser. Ppecifically, with Transrate considering only the parts of the header before the 1st space or
|
, TR20 is considered non-unique in the below example.