De novo assemble clam PE short reads

sr320 commented 1 year ago

2 files (and checksums) available at https://gannet.fish.washington.edu/seashell/wd/ln/

kubu4 commented 1 year ago

It's not terribly important, but I like to know genus/species for my workflow. Do you happen to have that info?

sr320 commented 1 year ago

Little neck clam

kubu4 commented 1 year ago

Trinity assembly stats:

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  502826
Total trinity transcripts:  645444
Percent GC: 36.80

########################################
Stats based on ALL transcript contigs:
########################################

    Contig N10: 3398
    Contig N20: 2073
    Contig N30: 1301
    Contig N40: 862
    Contig N50: 609

    Median contig length: 319
    Average contig: 516.94
    Total assembled bases: 333658646

#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

    Contig N10: 2495
    Contig N20: 1337
    Contig N30: 829
    Contig N40: 588
    Contig N50: 455

    Median contig length: 300
    Average contig: 439.38
    Total assembled bases: 220929754

I followed up this assembly using Transdecoder to get a better idea of how many of these "genes" are actually "functional" - this is done by using blastp and Pfam alignments, as well as using a Hidden Markov Model to predict open reading frames (ORFs).

Transdecoder identified:

74,533 genes (obviously a massive reduction from what Trinity counts). This number includes both complete and partial genes (i.e. ORFs)
28,451 complete ORFs.

Links:

Trinity assembly directory:

https://gannet.fish.washington.edu/Atumefaciens/20230616-lsta-trinity-RNAseq/

Trinity FastA:

https://gannet.fish.washington.edu/Atumefaciens/20230616-lsta-trinity-RNAseq/lsta-de_novo-transcriptome_v1.0.fasta (359MB)
- MD5 checksum: 1b5029cd4dbd5ff55bcf81c8dd62f236

Trinity FastA index:

https://gannet.fish.washington.edu/Atumefaciens/20230616-lsta-trinity-RNAseq/lsta-de_novo-transcriptome_v1.0.fasta.fai

Trinity stats file:

https://gannet.fish.washington.edu/Atumefaciens/20230616-lsta-trinity-RNAseq/lsta-de_novo-transcriptome_v1.0.fasta_assembly_stats.txt

Transdecoder directory:

https://gannet.fish.washington.edu/Atumefaciens/20230617-lsta-transdecoder-transcriptome_v1.0/

Transdecoder GFF3:

https://gannet.fish.washington.edu/Atumefaciens/20230617-lsta-transdecoder-transcriptome_v1.0/lsta-de_novo-transcriptome_v1.0.fasta.transdecoder.gff3 (99MB)

Transdecoder BED:

https://gannet.fish.washington.edu/Atumefaciens/20230617-lsta-transdecoder-transcriptome_v1.0/lsta-de_novo-transcriptome_v1.0.fasta.transdecoder.bed (29MB)

mgavery commented 1 year ago

@kubu4, I do you typically see a big drop off in mapping rates when using Transdecoder? It's not a direct comparison, but mapping rate went from 50-92% (depending on tissue sample) from Giles' Trinity transcriptome, to 21-34% for the Transdecoder filtered transcriptome. Of course I would expect to reduce mapping when dropping contigs, but this seems like a lot so I just wonder if there is any comparison you may have?

mgavery commented 1 year ago

Oh, here is a spreadsheet of mapping rates with various assemblies. The comparison above corresponds to column J (Trinity transcriptome) and column K (Transdecoder filtered transcriptome). You probably have to request permission from Giles to view https://docs.google.com/spreadsheets/d/15AVtCiVotFQotXtILx2pZ3gQswLyTew_S3eBaPM6yKc/edit#gid=0

kubu4 commented 1 year ago

I've never tried mapping reads back to the Transdecoder results before. However, that difference in mapping rates isn't surprising when you look at the number of "genes" identified by Trinity (502826) and the number of complete/partial ORFs identified by Transdecoder (74533) - that's a reduction of ~85%...

So, the mapping rates you're seeing when comparing the Trinity transcriptome and the Transdecoder results seem like what would be expected.

mgavery commented 1 year ago

Ok, thanks. Right now I feel like it's a tradeoff between a strong filter like Transdecoder and mapping rate. I might imagine that having deeper sequencing might improve that? - but what if it's other stuff (lncRNAs etc) that Transdecoder isn't grabbing. Have you done a search for those in assembly data before? Unrelated'ish, I'm going include a snapshot of the MEGAN6 output here, just for an idea of what's in there - it's coming up mostly as clam which is a good thing.

Little_Neck_Clam_Update_-_mackenzie_gavery_noaa_gov_-_National_Oceanic_and_Atmospheric_Administration_Mail

kubu4 commented 1 year ago

but what if it's other stuff (lncRNAs etc) that Transdecoder isn't grabbing.

lncRNAs, miRNAs, spurious transcripts, degraded RNA, etc. Transdecoder is only concerned with identifying possible ORFs. That's all. So, there's going to be a lot of "junk" that Transdecoder is going to ignore/rule out. As you noted, increase sequencing depth will almost always improve assemblies (and subsequent annotations).

Have you done a search for those in assembly data before?

Nope.

RobertsLab / resources

De novo assemble clam PE short reads #1655