blahah / transrate

Understand your transcriptome assembly
http://hibberdlab.com/transrate
Other
100 stars 34 forks source link

Merge and cufflinks #154

Closed colindaven closed 9 years ago

colindaven commented 9 years ago

Thanks for this program and the nice docs. It certainly fills a gap.

I am creating a large transcript set from 13 different tissues from a non-model organism. 12/13 sets come from Cufflinks, while the 13th comes from Trinity. How useful is Transrate for cufflinks derived datasets?

I'm afraid I couldn't find details on the merge command in the documentation.

More concretely, pre transrate merge I had 459k transcripts, post transrate 405k. What constitutes a poor transcript which was removed ? We are particularly interested in keeping non-protein ORFs in the transcript set, since this is a poorly annotated non-model organism.

head stats_pre_transrate.txt

Reads: 459186

Bases: 545698663

Max: 26635

Min: 15

Avg: 1188,4

Median: 880

Mode: 230

Std_Dev: 997,1

head stats_post_transrate.txt

Reads: 405453

Bases: 497342372

Max: 26635

Min: 15

Avg: 1226,6

Median: 930

Mode: 400

Std_Dev: 1004,5

Command: transrate -t 32 --assembly ../src_fasta/10.fa,../src_fasta/11.fa,../src_fasta/12.fa,../src_fasta/13.fa,../src_fasta/1_new.fa,../src_fasta/2.fa,../src_fasta/3.fa,../src_fasta/4.fa,../src_fasta/5.fa,../src_fasta/6.fa,../src_fasta/7.fa,../src_fasta/8.fa,../src_fasta/9.fa --merge-assemblies=transrate1_out.fa

Thanks for your comments. Colin

blahah commented 9 years ago

Hi Colin,

It depends what your Cufflinks procedure was - did you just use the reads and the genome sequence, or did you pass in an existing genome annotation?

In the first case, you should be able to use transrate on the cufflinks assemblies as every putative transcript will be derived from the reads. However in the second case you will need to filter out contigs that match the existing annotation from the assemblies - there are several possible ways to do this.

The merge command is undocumented and experimental, but is fairly simple. It just combines the contigs from all the input assemblies into a single assembly, then runs transrate as normal to score all the contigs. After scoring it uses the normal optimisation strategy to discard poorly assembled contigs, so that you should be left with the best from every assembly.

In theory this is better than just combining the 'good' contigs from each assembly, because any contigs that are redundant between assemblies will compete with each other for reads during assignment, and the best one should always win, so you should end up with the best assembled version of each contig retained. However in practise we have found this doesn't work as well as we hoped (too many contigs are kept), so we have been working on an alternative that seems to work much better. We will have something ready to test in the next couple of weeks.

colindaven commented 9 years ago

Thanks for your answer and the positive outlook.

Certainly transcript merging is a very complex issue but is of critical importance for non-model organisms.

I agree, in most cases too many contigs/transcripts are kept from almost all approaches (eg. PASA, Cuffmerge).