ablab / spades

SPAdes Genome Assembler
http://ablab.github.io/spades/
Other
745 stars 135 forks source link

Strand-specificity and poly-N runs with rnaSPAdes #141

Closed jrodnz closed 6 years ago

jrodnz commented 6 years ago

Note: The info here has been updated on 3 Oct 2018 based on errors pointed out by @andrewprzh . This changes some of the discussion about strand-specificity.

First off, thanks for developing and supporting rnaSPAdes! I am testing v3.11.1 for strand-specific transcriptome assembly of hexaploid blueberry, and have encountered two issues:

  1. Incorrect insert size with --pe-1-rf Data consists of 3 libraries of 100 bp paired end reads (94 bp after trimming). From running picard tools on mapping to a related diploid genome, I determine that insert size is ~150 bp (median and mean within 15 bp). Libraries were generated with the Lexogen Sense mRNA kit, so R1 is antisense to the transcript and R2 is sense. Based on the rnaSPAdes manual, I first used --ss-rf and --pe1-rf since "The data set is strand-specific and first read in pair corresponds to reverse gene strand" and I assumed that reverse gene strand is the same as reverse orientation with respect to the transcript. However, I received the below warning in the log with --pe1-rf: 0:10:19.497 2G / 8G INFO General (pair_info_count.cpp : 351) Insert size = 62.0174, deviation = 28.041, left quantile = 21, right quantile = 97, read length = 94 0:10:19.497 2G / 8G WARN General (pair_info_count.cpp : 355) Estimated mean insert size 62.0174 is very small compared to read length 94 When run with --ss-fr or --ss-rf or --pe-fr, I do not get any warnings and estimated insert size is accurate. With --ss-fr: 0:14:01.029 2G / 10G INFO General (pair_info_count.cpp : 351) Insert size = 144.063, deviation = 57.2587, left quantile = 83, right quantile = 218, read length = 94 With --ss-rf: 0:10:56.513 2G / 8G INFO General (pair_info_count.cpp : 351) Insert size = 144.065, deviation = 57.2611, left quantile = 83, right quantile = 218, read length = 94 With --pe-fr (small difference is due to not normalizing reads as done for other runs): 1:49:26.236 6G / 13G INFO General (pair_info_count.cpp : 351) Insert size = 141.018, deviation = 52.2852, left quantile = 86, right quantile = 208, read length = 94

Furthermore, when I map outputted transcripts to the related diploid genome and compare with genome annotations, --ss-fr produces more complete transcripts than --pe1-rf. Also see averaeg transcript length in below table. What is the issue with --pe1-rf?

spades-fr-vs-rf_scaffold1_1194000-1225000

  1. Poly-N runs in output transcripts When I compare hard_filtered_transcripts.fasta sequences between the --ss-rf and --ss-fr runs, both show poly-N runs of length 10. Example: grep -v ">" rnaSPAdes-ss-rf/hard_filtered_transcripts.fasta | grep "N" GCGAATTTGCCGGNNNNNNNNNNTGGCCGGAGTTGGTGACATCATAGTCGTCGGCGAATT GGTTTGAGTCCACCATCTCCTGGCATNNNNNNNNNNCATTCCCACCAGGACCAGGTTTGA GCATGNNNNNNNNNNACTCAAACCTGGTCCTGGTGGGAATGGGTTTGGTGGAAGCATGCC ACGAAAAACCAACGCCAGCTGAAACAGAAGTCAAGAAGAAGGNNNNNNNNNNGCTGCAGA This problem was worse for the --ss-fr, --ss-rf and --pe-fr runs as shown in the table below. Why does the algorithm insert these poly-N runs in the transcripts?
Parameter --pe1-rf --pe1-fr* --ss-rf --ss-fr
Total number of transcripts in in hard_filtered_transcripts.fasta 80763 94945 91889 91885
No. of lines of hard_filtered_transcripts.fasta excluding headers and ID's 464501 696677 633260 633265
No. of lines of hard_filtered_transcripts.fasta excluding headers and ID's and including N's 4 6567 8042 8087
Average transcript length in hard_filtered_transcripts.fasta 316 411 384 384
Largest transcript length in hard_filtered_transcripts.fasta 5429 6958 5550 5550

*Note that this is unlike the other runs in that the dataset was not normalized before the run

Thanks in advance for your time!

andrewprzh commented 6 years ago

Thank you for the feedback and sorry for a delay!

  1. Seems like bug regarding the read-pair orientation in strand specific mode. Could you send spades.log files from both runs? Indeed, --ss-fr option outputs longer contigs since the read-pairs are treated properly, but contigs itself are most likely outputted with the opposite strand. We will check the SS pipeline and try to inspect the bug.

  2. Poly Ns appear due to the gap closing, especially in low-coverage regions. --ss-rf has the small number of those for the same reason -- read pairs in this mode are treated incorrectly and almost no gaps are closed. Could you try rnaSPAdes 3.12 -- it has somewhat different policy regarding Ns?

Best Andrey

jrodnz commented 6 years ago

@andrewprzh Thank you very much! I will try again with rnaSPAdes 3.12 and let you know how it goes.

jrodnz commented 6 years ago

Note: Below info updated on Oct 3 2018 to correct errors pointed out by @andrewprzh

@andrewprzh It seems that rnaSPAdes 3.12 introduced even more poly-N gaps than 3.11.1. Is this kind of gap closing due to paired end reads being split across contigs but no real sequence overlaps between contigs? Do you have any suggestions for ways to deal with these poly-N runs when defining ORFs within the transcripts? Thanks again.

Parameter v3.11.1 --pe1-rf v3.11.1 --pe1-fr* v3.12 --pe1-rf v3.12 --pe1-fr
Total number of transcripts in in hard_filtered_transcripts.fasta 80763 94945 89962 114476
No. of lines of hard_filtered_transcripts.fasta excluding headers and ID's 464501 696677 469010 732803
No. of lines of hard_filtered_transcripts.fasta excluding headers and ID's and including N's 4 6567 29 33319
Average transcript length in hard_filtered_transcripts.fasta 316 411 284 353
Largest transcript length in hard_filtered_transcripts.fasta 5429 6958 5429 5429
andrewprzh commented 6 years ago

Yes, you are right about the origin of the N's. There are also other possible reasons --- complex tandem repeats that could not be resolved by the assembler, but those are pretty rare for the transcriptomic data. Did you check how long are these N stretches?

My guess is that since N stretches have uncertain length, they may slightly spoil ORF detection. Did you try to analyze the resulting contigs with the tools you usually use for ORF finding?

jrodnz commented 6 years ago

Thanks for the thoughts, Andrey! Looking at v3.12 --pe-1-rf hard_filtered_transcripts.fasta, the 29 lines with N's correspond to 24 N stretches are mostly 10 nucleotides long (like with version 3.11.1) though there were two out of the 24 that were 100 nt long. I use evigene for ORF finding and it looks like these can introduce premature stop codons.

However, the weirder thing I found is that whether I run pe1-fr or pe1-rf, longestORFs are in antisense direction for some transcripts and sense for others. This leads to some more questions:

  1. What does the pe<#>- parameter do? I thought it was to inform strand-specificity of the library but now am unsure.
  2. Should I be specifying --ss-rf or --ss-fr instead or in addition to pe1-fr/rf? Will this be different for a single set of files versus multiple sets that need to be combined for the de novo assembly?
jrodnz commented 6 years ago

Update: Using --ss-fr along with --pe1-fr, --pe2-fr, ... will produce strand-specific output, so my work around is to do that and then reverse complement. Due to time constraints, I have not tried using --ss-rf along with --pe1-rf, but perhaps someone can update that here if they try it.

andrewprzh commented 6 years ago

Hi Jessica!

--pe-rf and --pe-fr options set relative orientation of Illumina reads. Paired-end reads typically (I'd say almost always) are oriented inwards (forward-reverse, ---> <---), mate-pairs are typically oriented outwards (reverse-forward, <--- --->). This actually has no relationship to RNA-Seq strand-specificity. As far as I know RNA-Seq paired reads are always paired-end and thus have FR orientation (--pe-fr which is set by default).

And --ss-rf --ss-fr options set strand-specificity (which does not depend on read-pairs relative orientation). So in your case using --ss-rf would give correct results, which as you pointed out, should be quite the same as using --ss-fr and then reverse-complementing the output sequences.

Best Andrey

jrodnz commented 6 years ago

Thanks, Andrey! It makes sense now. I appreciate your time and help!

lmuenter commented 4 years ago

Hello,

unfortunately, the problem still seems to persist (spades version 3.14). Roughly 50,000 of a total 200,000 contigs display poly-N stretches of ca. 3-15, sometimes even 30 nt! This problem only occurs in strand-specific mode. I assembled strand-specific RNAseq-data (TruSeq) from a polyploid plant with the command:

rnaspades.py -o "$outdir" --ss-rf --dataset "$samplesFile" -m "$max_memory" -t "$threads"

Interestingly, the transrate score was high at 0.39 - 0.45 (raw - optimised) and the assemblies were essentially BUSCO-complete (only 2% of empryhophyte BUSCOs were missing). Using Salmon, a mapping rate of 96% was found, and the library type was guessed correctly as "ISR" (via --libtype A).

However, as previously predicted in this thread, the relative ORF-lengths were substantially decreased compared to a simple paired-end assembly with the same data (ORF-lengths were 52% vs 65% of total transcript lengths).

Could this be a bug with the dataset file? Might there be biological reasons for this behaviour (high transcriptomic complexity...)?

However, thank you so much for this program, it has otherwise worked like a charm!

PS: The dataset-file I used (In yaml format, but with a .txt extension, since github seemingly does not allow upload of .yaml-files): samples_list.txt

asl commented 4 years ago

@lmuenter Please open a separate issue, do not hijack the closed ones