Closed martsper closed 3 years ago
Tagging @andrewprzh
Dear Martin
Thank you for your question and thanks for pointing out inconsistencies in the manuals! Indeed, --pacbio option can handle CCS (as well as HiFi) reads. Moreover, I don't know whether anybody uses anything beside CCS reads in transcriptomic world, so this option in rnaSPAdes is meant primarily for CCS reads. We will fix explanations in the nearest release.
Here is a brief description of hybrid workflow in SPAdes (and rnaSPAdes):
Regarding the polyA and adapters. rnaSPAdes does remove polyA k-mers during graph construction, but does not trim long reads in any way. Since long reads are mapped to the graph, my guess is that polyA tails and adapters will simply remain unmapped in most cases. However, if trimming can be easily done, I'd suggest to feed trimmed reads to ensure these artifacts do not affect the assembly.
rnaSPAdes does not detect polyAs or adapters, so if reads are fed using --pacbio they are assumed to be PacBio reads with possible truncation at the ends. When reads are provided via --fl-rna, rnaSPAdes assumes that they capture the entire transcript and outputs the entire corresponding path in the graph (without duplicates of course). And yes, this option was designed specifically to support FLNCs.
So in your case I'd probably extract FLNC reads (not HQ isoforms) and feed them with --fl-rna. An alternative option is to feed both FLNC reads via --fl-rna and initial (but trimmed) CCS reads simply via --pacbio.
Don't hesitate to ask more question if something remains unclear. Best Andrey
Dear Andrey,
Thanks a lot for the information, this was very enlightening. I have indeed one or two more questions.
First, I would like to note that PacBio FLNC reads are not necessarily true full-length reads. There are IsoSeq library preparation protocols available that can be used to select for full-length mRNA transcripts, based on an integer polyA tail AND a 5' cap. But, if such protocol is NOT used, than IsoSeq also captures mRNA transcripts that are partially 5'-degraded (I think the official PacBio IsoSeq protocol does not include 5' cap selection). In such cases, FLNC reads have integer, flanking cDNA primers and a poly(A) tails, but also include partially 5'-degraded mRNA. Not sure if this will effect the outcome of the --fl-rna option.
Why I started this discussion: I'm having a project in which we used hybrid rnaSPAdes on stranded Illumina paired-end reads (2 x 75 bp; circa 300 Mio reads) and PacBio HQ isoforms (circa 40,000 transcripts; no 5'-cap selection), using the rnaSPAdes --fl-rna option. However, based on rnaQUAST transcriptome assessment (see below), the integration of the PacBio HQ isoforms had no apparent advantage. Actually, the opposite was the case: hybrid rnaSPAdes joined transcripts that derived from neighboring genes encoded on opposite strands (resulting in more misassembled transcripts). I attached an alignment that visualizes the problem.
Comparison of Hybrid assembly vs. Illumina-only assembly, using rnaQUAST:
SHORT SUMMARY REPORT
METRICS/TRANSCRIPTS Hybrid assembly Illumina-only assembly
== DATABASE METRICS ==
Genes 38544 38544
Avg. number of exons per isoform 3.585 3.585
== BASIC TRANSCRIPTS METRICS ==
Transcripts 64858 64777
Transcripts > 500 bp 38505 37780
Transcripts > 1000 bp 26723 24751
== ALIGNMENT METRICS ==
Aligned 56096 55929
Uniquely aligned 46042 46103
Multiply aligned 6160 6254
Unaligned 8762 8848
== ALIGNMENT METRICS FOR NON-MISASSEMBLED TRANSCRIPTS ==
Avg. aligned fraction 0.926 0.928
Avg. alignment length 808.186 753.516
Avg. mismatches per transcript 5.845 5.604
== ALIGNMENT METRICS FOR MISASSEMBLED (CHIMERIC) TRANSCRIPTS ==
Misassemblies 2424 2093
== ASSEMBLY COMPLETENESS (SENSITIVITY) ==
Database coverage 0.393 0.401
Duplication ratio 1.568 1.461
50%-assembled genes 12859 13189
95%-assembled genes 7121 7134
50%-covered genes 13663 14055
95%-covered genes 7938 8086
50%-assembled isoforms 12860 13190
95%-assembled isoforms 7121 7134
50%-covered isoforms 13664 14056
95%-covered isoforms 7938 8086
Mean isoform coverage 0.764 0.765
Mean isoform assembly 0.724 0.722
== ASSEMBLY SPECIFICITY ==
50%-matched 21232 21743
95%-matched 3815 4026
Unannotated 19423 19753
Mean fraction of transcript matched 0.295 0.299
Alignment of Illumina-only assembly and hybrid assembly to reference genome (non-model organism, circa 150 Mbp): 1) Gene models in reference genome 2) Raw Illumina reads 3) PacBio HQ isoforms (the two colors indicate the +/- strand) 4) Illumina-only assembly (v3.14.0; --pe --ss-rf) 5) Hybrid assembly (v3.14.0; --pe --ss-rf --fl-rna)
It appears that the --fl-rna option is not aware about strand specific information, and I am curious if you encountered similar problems. We will now re-run hybrid rnaSPAdes with --pacbio and CCS reads (not using --fl-rna). I hope this makes a difference.
Best, Martin
Hi,
We compared now the --pacbio and --fl-rna option of hybrid rnaSPAdes. While --fl-rna appears to lose strand information, and joins transcripts from neighboring genes on opposing strands, the --pacbio options behaves as expected. See below an alignment of raw sequencing data and transcriptome assemblies to a reference genome:
Based on this, it appears to me that the current best practice should be to use PacBio CCS reads and the --pacbio option for hybrid rnaSPAdes. Did you observed similar problems with --fl-rna?
Best, Martin
Dear Martin
I'm very sorry for a delayed response.
Thanks for pointing out details of IsoSeq protocols! As far as I see in the code, there shouldn't be a big problem. When there are multiple reads from the same isoform, rnaSPAdes selects the longest path in the graph, and all subpaths (originated from 5' degraded molecules) are removed. Thus, I expect the most complete isoform to be reported.
Thank you for sharing you results. That is, indeed discouraging and encouraging at same time depending on which IGV panel I look :) In fact, I cannot say I have experienced something like this before, but I see a clear flaw here. Would it be possible to share some data so I could dig into the problem? E.g. subset of reads that corresponds to the shown genome region? We can discuss it by email (in my GitHub profile) if you'd like to.
Out of curiosity, why de novo assembly was used when the reference genome and annotation are available? I ask this question because now I work a lot on reference-based methods for isoform detection from long reads and we have a new tool released just recently https://github.com/ablab/IsoQuant
All the best Andrey
Hi Andrey,
Thanks a lot for the reply. Yes, I would be happy to share some data. I will send it to you in private.
For your question: The reference genome is from a non-model organism with high genomic variability; the assembly is fragmented, with unclassified repeats and duplications and possible contaminations. Also the annotations are rather uncertain. I was afraid of introducing artifacts when using the reference genome for the transcriptome assembly (but I never tried). Good luck with IsoQuant, this is surely a worthwhile effort! (especially when PacBio sequencing depth will eventually become competitive).
Best, Martin
Dear SPAdes team,
I am a bit puzzled about the best practice for the hybrid rnaSPAdes algorithm, when used with PacBio IsoSeq long reads and Illumina short reads.
Form the different spades manuals (and publications), I understood that I can choose for PacBio long read integration between the options --pacbio, -s and --fl-rna.
The main spades manual (https://github.com/ablab/spades) explains to use the --pacbio option ONLY with PacBio CLR reads (which are not the right reads for hybrid rnaSPAdes, since the PacBio IsoSeq long read sequencing protocol works in CCS mode, and not CLR). Therefore, it appears that the -s option is the right choice for PacBio CCS read integration.
The main spades manual further says that "PacBio CCS/Reads of Insert reads [...] can be simply provided as single reads to SPAdes." (with -s). I just want to make sure: PacBio CCS reads, as produced by the SMRT Link software, are Q20 filtered, include non-full length reads, and contain poly(A) tails and cDNA primer artifacts. Is the spades assembler taking care of the poly(A) tails and cDNA primers? If not, the SMRT Link software has the option to filter the CCS reads for those that are having both cDNA primers and a poly(A) tail. The SMRT link software can further trim those sequences, generating full-length reads, called FLNC reads. Means, in case that the spades assembler does not account for poly(A) tails and cDNA primers in CCS reads, would you say that using the PacBio FLNC reads is the best practice for hybrid rnaSPAdes? Should the FLNC reads be added as FASTQ?
Now it gets confusing. Within the rnaSPAdes manual, the option of using the -s flag is not mentioned. From the rnaSPAdes manual, it seems that the --pacbio option is the right choice for hybrid rnaSPAdes (though we learned in the main manual that the --pacbio option is for CLR reads, and not for CCS reads). Does it mean that the --pacbio option can actually handle CCS reads (and as said above, should the CCS reads be trimmed from poly(A) tails and cDNA primers?)?
Finally, the rnaSPAdes manual says that besides using the --pacbio option, full length PacBio reads can be additionally added with the --fl-rna option. With this option, I could now either add the above mentioned PacBio full length FLNC reads, or PacBio full length HQ isoforms, both produced by the SMRT Link software (the latter are FLNC reads that underwent rigorous clustering, consensus calling, and filtering, and have therefore a much higher quality than the FLNC reads, but miss some lowly expressed transcripts). I wonder now what is more important for the --fl-rna option: having rather a mass of PacBio FLNC full length reads, or preferentially only those with high quality? (it is probably the responsibility of the user to test the right options; but I would be curious about your opinion). Considering all the above mentioned, the best practice appears to be using hybrid rnaSPAes with --pacbio FLNC.fastq AND --fl-rna FLNC.fasta (?)
Please don't understand me wrong, I love rnaSPAdes and the idea of the hybrid assembly is awesome. I am just curious about the right choice (I am using rnaQUAST to evaluate assemblies, but I just wonder what makes conceptually more sense).
Best, Martin