Closed sr320 closed 1 year ago
It's not terribly important, but I like to know genus/species for my workflow. Do you happen to have that info?
Little neck clam
Trinity assembly stats:
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 502826
Total trinity transcripts: 645444
Percent GC: 36.80
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 3398
Contig N20: 2073
Contig N30: 1301
Contig N40: 862
Contig N50: 609
Median contig length: 319
Average contig: 516.94
Total assembled bases: 333658646
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 2495
Contig N20: 1337
Contig N30: 829
Contig N40: 588
Contig N50: 455
Median contig length: 300
Average contig: 439.38
Total assembled bases: 220929754
I followed up this assembly using Transdecoder to get a better idea of how many of these "genes" are actually "functional" - this is done by using blastp
and Pfam
alignments, as well as using a Hidden Markov Model to predict open reading frames (ORFs).
Transdecoder identified:
74,533 genes (obviously a massive reduction from what Trinity counts). This number includes both complete and partial genes (i.e. ORFs)
28,451 complete ORFs.
Links:
Trinity assembly directory:
Trinity FastA:
1b5029cd4dbd5ff55bcf81c8dd62f236
Trinity FastA index:
Trinity stats file:
Transdecoder directory:
Transdecoder GFF3:
Transdecoder BED:
@kubu4, I do you typically see a big drop off in mapping rates when using Transdecoder? It's not a direct comparison, but mapping rate went from 50-92% (depending on tissue sample) from Giles' Trinity transcriptome, to 21-34% for the Transdecoder filtered transcriptome. Of course I would expect to reduce mapping when dropping contigs, but this seems like a lot so I just wonder if there is any comparison you may have?
Oh, here is a spreadsheet of mapping rates with various assemblies. The comparison above corresponds to column J (Trinity transcriptome) and column K (Transdecoder filtered transcriptome). You probably have to request permission from Giles to view https://docs.google.com/spreadsheets/d/15AVtCiVotFQotXtILx2pZ3gQswLyTew_S3eBaPM6yKc/edit#gid=0
I've never tried mapping reads back to the Transdecoder results before. However, that difference in mapping rates isn't surprising when you look at the number of "genes" identified by Trinity (502826
) and the number of complete/partial ORFs identified by Transdecoder (74533
) - that's a reduction of ~85%...
So, the mapping rates you're seeing when comparing the Trinity transcriptome and the Transdecoder results seem like what would be expected.
Ok, thanks. Right now I feel like it's a tradeoff between a strong filter like Transdecoder and mapping rate. I might imagine that having deeper sequencing might improve that? - but what if it's other stuff (lncRNAs etc) that Transdecoder isn't grabbing. Have you done a search for those in assembly data before? Unrelated'ish, I'm going include a snapshot of the MEGAN6 output here, just for an idea of what's in there - it's coming up mostly as clam which is a good thing.
but what if it's other stuff (lncRNAs etc) that Transdecoder isn't grabbing.
lncRNAs, miRNAs, spurious transcripts, degraded RNA, etc. Transdecoder is only concerned with identifying possible ORFs. That's all. So, there's going to be a lot of "junk" that Transdecoder is going to ignore/rule out. As you noted, increase sequencing depth will almost always improve assemblies (and subsequent annotations).
Have you done a search for those in assembly data before?
Nope.
2 files (and checksums) available at https://gannet.fish.washington.edu/seashell/wd/ln/