blahah / transrate

Understand your transcriptome assembly
http://hibberdlab.com/transrate
Other
100 stars 34 forks source link

new Trinity fasta headers causing problems #127

Open macmanes opened 9 years ago

macmanes commented 9 years ago

There has been a change in the https://github.com/trinityrnaseq/trinityrnaseq fasta headers that does not work well with the Transrate fasta parser. Ppecifically, with Transrate considering only the parts of the header before the 1st space or |, TR20 is considered non-unique in the below example.

 >TR18|c0_g1_i1 len=387 path=[365:0-386] [-1, 365, -2]
 >TR19|c0_g1_i1 len=1400 path=[2755:0-1399] [-1, 2755, -2]
 >TR20|c0_g1_i1 len=2936 path=[6883:0-640 6884:641-660 6910:661-797 6909:798-883 6900:884-904 6901:905-2935] [-1, 6883, 6884, 6910, 6909, 6900, 6901, -2]
 >TR20|c0_g1_i2 len=1072 path=[6888:0-640 6911:641-827 6910:828-964 6879:965-1071] [-1, 6888, 6911, 6910, 6879, -2]
 >TR20|c0_g1_i3 len=2995 path=[6883:0-640 6884:641-660 6910:661-797 6909:798-883 6907:884-943 6904:944-962 6892:963-963 6893:964-2994] [-1, 6883, 6884, 6910, 6909, 6907, 6904, 6892, 6893, -2]
 >TR20|c0_g2_i1 len=2276 path=[6886:0-82 6906:83-243 6881:244-244 6882:245-2275] [-1, 6886, 6906, 6881, 6882, -2]
blahah commented 9 years ago

I'm not sure yet about the best solution to this problem.

The problem stems from the fact that Trinity has moved to using a system that is very non-standard in practise, but that the NCBI claims is standard and uses in its own headers. NCBI databases like genbank use the format >db_identifier|sequence_identifier, while almost every other system uses >sequence_identifier|other description.

Transrate uses the BioRuby FASTA parser, which has a pretty sensible design - it interprets everything before the first | or space as the sequence identifier, except in special cases for the NCBI databases.

So the question is, do we extend the parser within Transrate to add a new special case for Trinity? Or modify the upstream BioRuby? Or do we ask the Trinity people to not follow this route (which seems to me to be likely to break a lot of tools).

cc @cboursnell

blahah commented 9 years ago

The Trinity devs have agreed to change their output FASTA entry ID format to TR_c_g_I rather than TR|c_g_i, which will solve the problem from their next release onwards (see https://github.com/trinityrnaseq/trinityrnaseq/issues/21).

We still have the issue that any assemblies produced using the affected versions of Trinity will cause a parsing problem. Two options come to mind:

  1. We add a work-around (I'm more willing to do this now that we know the Trinity devs are responsive, as it means we are unlikely to have to add endless work-arounds)
  2. We catch the problem and print a helpful error message, perhaps including a unix one-liner that can rename the FASTA entries
standage commented 9 years ago

Just ran into this problem and solved it temporarily with a quick sed command. I think either fix suggested for the long term would work.

You mentioned that NCBI's syntax is non-standard in practice, although that has not been my experience. I've been referred to ftp://ftp.ncbi.nih.gov/blast/documents/formatdb.html (see Fasta Defline Format section) several times as the authoritative spec on the topic, and I've seen its use (especially the gnl|DbName|Accession format) quite a bit out in the wild.

blahah commented 9 years ago

I've now added some extra information to the error message that explains why Trinity might be the problem, and how to fix if it so.

peterdfields commented 8 years ago

The provided sed command can do some very strange things (underscores at the beginning of the fasta header and at the beginning of the assembled transcript). I found using:

sed -e 's/|/_/g'

worked a bit better.

blahah commented 8 years ago

@peterdfields thanks - yes, the command in v1.0.1 only works for the Mac version of OSX. In v1.0.2 we give separate linux and OSX commands. Which version are you using?

peterdfields commented 8 years ago

v1.0.1. Time to update!