ablab / spades

SPAdes Genome Assembler
http://ablab.github.io/spades/
Other
763 stars 139 forks source link

[QUESTION] Best practice for input of PacBio IsoSeq data into hybrid rnaSPAdes #713

Closed martsper closed 3 years ago

martsper commented 3 years ago

Dear SPAdes team,

I am a bit puzzled about the best practice for the hybrid rnaSPAdes algorithm, when used with PacBio IsoSeq long reads and Illumina short reads.

Form the different spades manuals (and publications), I understood that I can choose for PacBio long read integration between the options --pacbio, -s and --fl-rna.

The main spades manual (https://github.com/ablab/spades) explains to use the --pacbio option ONLY with PacBio CLR reads (which are not the right reads for hybrid rnaSPAdes, since the PacBio IsoSeq long read sequencing protocol works in CCS mode, and not CLR). Therefore, it appears that the -s option is the right choice for PacBio CCS read integration.

The main spades manual further says that "PacBio CCS/Reads of Insert reads [...] can be simply provided as single reads to SPAdes." (with -s). I just want to make sure: PacBio CCS reads, as produced by the SMRT Link software, are Q20 filtered, include non-full length reads, and contain poly(A) tails and cDNA primer artifacts. Is the spades assembler taking care of the poly(A) tails and cDNA primers? If not, the SMRT Link software has the option to filter the CCS reads for those that are having both cDNA primers and a poly(A) tail. The SMRT link software can further trim those sequences, generating full-length reads, called FLNC reads. Means, in case that the spades assembler does not account for poly(A) tails and cDNA primers in CCS reads, would you say that using the PacBio FLNC reads is the best practice for hybrid rnaSPAdes? Should the FLNC reads be added as FASTQ?

Now it gets confusing. Within the rnaSPAdes manual, the option of using the -s flag is not mentioned. From the rnaSPAdes manual, it seems that the --pacbio option is the right choice for hybrid rnaSPAdes (though we learned in the main manual that the --pacbio option is for CLR reads, and not for CCS reads). Does it mean that the --pacbio option can actually handle CCS reads (and as said above, should the CCS reads be trimmed from poly(A) tails and cDNA primers?)?

Finally, the rnaSPAdes manual says that besides using the --pacbio option, full length PacBio reads can be additionally added with the --fl-rna option. With this option, I could now either add the above mentioned PacBio full length FLNC reads, or PacBio full length HQ isoforms, both produced by the SMRT Link software (the latter are FLNC reads that underwent rigorous clustering, consensus calling, and filtering, and have therefore a much higher quality than the FLNC reads, but miss some lowly expressed transcripts). I wonder now what is more important for the --fl-rna option: having rather a mass of PacBio FLNC full length reads, or preferentially only those with high quality? (it is probably the responsibility of the user to test the right options; but I would be curious about your opinion). Considering all the above mentioned, the best practice appears to be using hybrid rnaSPAes with --pacbio FLNC.fastq AND --fl-rna FLNC.fasta (?)

Please don't understand me wrong, I love rnaSPAdes and the idea of the hybrid assembly is awesome. I am just curious about the right choice (I am using rnaQUAST to evaluate assemblies, but I just wonder what makes conceptually more sense).

Best, Martin

asl commented 3 years ago

Tagging @andrewprzh

andrewprzh commented 3 years ago

Dear Martin

Thank you for your question and thanks for pointing out inconsistencies in the manuals! Indeed, --pacbio option can handle CCS (as well as HiFi) reads. Moreover, I don't know whether anybody uses anything beside CCS reads in transcriptomic world, so this option in rnaSPAdes is meant primarily for CCS reads. We will fix explanations in the nearest release.

Here is a brief description of hybrid workflow in SPAdes (and rnaSPAdes):

Regarding the polyA and adapters. rnaSPAdes does remove polyA k-mers during graph construction, but does not trim long reads in any way. Since long reads are mapped to the graph, my guess is that polyA tails and adapters will simply remain unmapped in most cases. However, if trimming can be easily done, I'd suggest to feed trimmed reads to ensure these artifacts do not affect the assembly.

rnaSPAdes does not detect polyAs or adapters, so if reads are fed using --pacbio they are assumed to be PacBio reads with possible truncation at the ends. When reads are provided via --fl-rna, rnaSPAdes assumes that they capture the entire transcript and outputs the entire corresponding path in the graph (without duplicates of course). And yes, this option was designed specifically to support FLNCs.

So in your case I'd probably extract FLNC reads (not HQ isoforms) and feed them with --fl-rna. An alternative option is to feed both FLNC reads via --fl-rna and initial (but trimmed) CCS reads simply via --pacbio.

Don't hesitate to ask more question if something remains unclear. Best Andrey

martsper commented 3 years ago

Dear Andrey,

Thanks a lot for the information, this was very enlightening. I have indeed one or two more questions.

First, I would like to note that PacBio FLNC reads are not necessarily true full-length reads. There are IsoSeq library preparation protocols available that can be used to select for full-length mRNA transcripts, based on an integer polyA tail AND a 5' cap. But, if such protocol is NOT used, than IsoSeq also captures mRNA transcripts that are partially 5'-degraded (I think the official PacBio IsoSeq protocol does not include 5' cap selection). In such cases, FLNC reads have integer, flanking cDNA primers and a poly(A) tails, but also include partially 5'-degraded mRNA. Not sure if this will effect the outcome of the --fl-rna option.

Why I started this discussion: I'm having a project in which we used hybrid rnaSPAdes on stranded Illumina paired-end reads (2 x 75 bp; circa 300 Mio reads) and PacBio HQ isoforms (circa 40,000 transcripts; no 5'-cap selection), using the rnaSPAdes --fl-rna option. However, based on rnaQUAST transcriptome assessment (see below), the integration of the PacBio HQ isoforms had no apparent advantage. Actually, the opposite was the case: hybrid rnaSPAdes joined transcripts that derived from neighboring genes encoded on opposite strands (resulting in more misassembled transcripts). I attached an alignment that visualizes the problem.

Comparison of Hybrid assembly vs. Illumina-only assembly, using rnaQUAST:

SHORT SUMMARY REPORT 

METRICS/TRANSCRIPTS                                    Hybrid assembly          Illumina-only assembly     

 == DATABASE METRICS == 
Genes                                                  38544                    38544                    
Avg. number of exons per isoform                       3.585                    3.585

 == BASIC TRANSCRIPTS METRICS == 
Transcripts                                            64858                    64777                    
Transcripts > 500 bp                                   38505                    37780                    
Transcripts > 1000 bp                                  26723                    24751

 == ALIGNMENT METRICS == 
Aligned                                                56096                    55929                    
Uniquely aligned                                       46042                    46103                    
Multiply aligned                                       6160                     6254                     
Unaligned                                              8762                     8848

 == ALIGNMENT METRICS FOR NON-MISASSEMBLED TRANSCRIPTS == 
Avg. aligned fraction                                  0.926                    0.928                    
Avg. alignment length                                  808.186                  753.516                  
Avg. mismatches per transcript                         5.845                    5.604

 == ALIGNMENT METRICS FOR MISASSEMBLED (CHIMERIC) TRANSCRIPTS == 
Misassemblies                                          2424                     2093

 == ASSEMBLY COMPLETENESS (SENSITIVITY) == 
Database coverage                                      0.393                    0.401                    
Duplication ratio                                      1.568                    1.461                    
50%-assembled genes                                    12859                    13189                    
95%-assembled genes                                    7121                     7134                     
50%-covered genes                                      13663                    14055                    
95%-covered genes                                      7938                     8086                     
50%-assembled isoforms                                 12860                    13190                    
95%-assembled isoforms                                 7121                     7134                     
50%-covered isoforms                                   13664                    14056                    
95%-covered isoforms                                   7938                     8086                     
Mean isoform coverage                                  0.764                    0.765                    
Mean isoform assembly                                  0.724                    0.722

 == ASSEMBLY SPECIFICITY == 
50%-matched                                            21232                    21743                    
95%-matched                                            3815                     4026                     
Unannotated                                            19423                    19753                    
Mean fraction of transcript matched                    0.295                    0.299

Alignment of Illumina-only assembly and hybrid assembly to reference genome (non-model organism, circa 150 Mbp): 1) Gene models in reference genome 2) Raw Illumina reads 3) PacBio HQ isoforms (the two colors indicate the +/- strand) 4) Illumina-only assembly (v3.14.0; --pe --ss-rf) 5) Hybrid assembly (v3.14.0; --pe --ss-rf --fl-rna)

Capture

It appears that the --fl-rna option is not aware about strand specific information, and I am curious if you encountered similar problems. We will now re-run hybrid rnaSPAdes with --pacbio and CCS reads (not using --fl-rna). I hope this makes a difference.

Best, Martin

martsper commented 3 years ago

Hi,

We compared now the --pacbio and --fl-rna option of hybrid rnaSPAdes. While --fl-rna appears to lose strand information, and joins transcripts from neighboring genes on opposing strands, the --pacbio options behaves as expected. See below an alignment of raw sequencing data and transcriptome assemblies to a reference genome:

  1. Gene models in reference genome
  2. Raw Illumina reads
  3. PacBio HQ isoforms (the two colors indicate the +/- strand)
  4. Illumina-only assembly (v3.14.0; --pe --ss-rf)
  5. Hybrid assembly (v3.14.0; --pe --ss-rf --fl-rna)
  6. Hybrid assembly (v3.14.0; --pe --ss-rf --pacbio)

Picture2

Based on this, it appears to me that the current best practice should be to use PacBio CCS reads and the --pacbio option for hybrid rnaSPAdes. Did you observed similar problems with --fl-rna?

Best, Martin

andrewprzh commented 3 years ago

Dear Martin

I'm very sorry for a delayed response.

Thanks for pointing out details of IsoSeq protocols! As far as I see in the code, there shouldn't be a big problem. When there are multiple reads from the same isoform, rnaSPAdes selects the longest path in the graph, and all subpaths (originated from 5' degraded molecules) are removed. Thus, I expect the most complete isoform to be reported.

Thank you for sharing you results. That is, indeed discouraging and encouraging at same time depending on which IGV panel I look :) In fact, I cannot say I have experienced something like this before, but I see a clear flaw here. Would it be possible to share some data so I could dig into the problem? E.g. subset of reads that corresponds to the shown genome region? We can discuss it by email (in my GitHub profile) if you'd like to.

Out of curiosity, why de novo assembly was used when the reference genome and annotation are available? I ask this question because now I work a lot on reference-based methods for isoform detection from long reads and we have a new tool released just recently https://github.com/ablab/IsoQuant

All the best Andrey

martsper commented 3 years ago

Hi Andrey,

Thanks a lot for the reply. Yes, I would be happy to share some data. I will send it to you in private.

For your question: The reference genome is from a non-model organism with high genomic variability; the assembly is fragmented, with unclassified repeats and duplications and possible contaminations. Also the annotations are rather uncertain. I was afraid of introducing artifacts when using the reference genome for the transcriptome assembly (but I never tried). Good luck with IsoQuant, this is surely a worthwhile effort! (especially when PacBio sequencing depth will eventually become competitive).

Best, Martin