Closed jrodnz closed 6 years ago
Thank you for the feedback and sorry for a delay!
Seems like bug regarding the read-pair orientation in strand specific mode. Could you send spades.log files from both runs? Indeed, --ss-fr option outputs longer contigs since the read-pairs are treated properly, but contigs itself are most likely outputted with the opposite strand. We will check the SS pipeline and try to inspect the bug.
Poly Ns appear due to the gap closing, especially in low-coverage regions. --ss-rf has the small number of those for the same reason -- read pairs in this mode are treated incorrectly and almost no gaps are closed. Could you try rnaSPAdes 3.12 -- it has somewhat different policy regarding Ns?
Best Andrey
@andrewprzh Thank you very much! I will try again with rnaSPAdes 3.12 and let you know how it goes.
Note: Below info updated on Oct 3 2018 to correct errors pointed out by @andrewprzh
@andrewprzh It seems that rnaSPAdes 3.12 introduced even more poly-N gaps than 3.11.1. Is this kind of gap closing due to paired end reads being split across contigs but no real sequence overlaps between contigs? Do you have any suggestions for ways to deal with these poly-N runs when defining ORFs within the transcripts? Thanks again.
Parameter | v3.11.1 --pe1-rf | v3.11.1 --pe1-fr* | v3.12 --pe1-rf | v3.12 --pe1-fr |
---|---|---|---|---|
Total number of transcripts in in hard_filtered_transcripts.fasta | 80763 | 94945 | 89962 | 114476 |
No. of lines of hard_filtered_transcripts.fasta excluding headers and ID's | 464501 | 696677 | 469010 | 732803 |
No. of lines of hard_filtered_transcripts.fasta excluding headers and ID's and including N's | 4 | 6567 | 29 | 33319 |
Average transcript length in hard_filtered_transcripts.fasta | 316 | 411 | 284 | 353 |
Largest transcript length in hard_filtered_transcripts.fasta | 5429 | 6958 | 5429 | 5429 |
Yes, you are right about the origin of the N's. There are also other possible reasons --- complex tandem repeats that could not be resolved by the assembler, but those are pretty rare for the transcriptomic data. Did you check how long are these N stretches?
My guess is that since N stretches have uncertain length, they may slightly spoil ORF detection. Did you try to analyze the resulting contigs with the tools you usually use for ORF finding?
Thanks for the thoughts, Andrey! Looking at v3.12 --pe-1-rf hard_filtered_transcripts.fasta, the 29 lines with N's correspond to 24 N stretches are mostly 10 nucleotides long (like with version 3.11.1) though there were two out of the 24 that were 100 nt long. I use evigene for ORF finding and it looks like these can introduce premature stop codons.
However, the weirder thing I found is that whether I run pe1-fr or pe1-rf, longestORFs are in antisense direction for some transcripts and sense for others. This leads to some more questions:
Update: Using --ss-fr along with --pe1-fr, --pe2-fr, ... will produce strand-specific output, so my work around is to do that and then reverse complement. Due to time constraints, I have not tried using --ss-rf along with --pe1-rf, but perhaps someone can update that here if they try it.
Hi Jessica!
--pe-rf and --pe-fr options set relative orientation of Illumina reads. Paired-end reads typically (I'd say almost always) are oriented inwards (forward-reverse, ---> <---), mate-pairs are typically oriented outwards (reverse-forward, <--- --->). This actually has no relationship to RNA-Seq strand-specificity. As far as I know RNA-Seq paired reads are always paired-end and thus have FR orientation (--pe-fr which is set by default).
And --ss-rf --ss-fr options set strand-specificity (which does not depend on read-pairs relative orientation). So in your case using --ss-rf would give correct results, which as you pointed out, should be quite the same as using --ss-fr and then reverse-complementing the output sequences.
Best Andrey
Thanks, Andrey! It makes sense now. I appreciate your time and help!
Hello,
unfortunately, the problem still seems to persist (spades version 3.14). Roughly 50,000 of a total 200,000 contigs display poly-N stretches of ca. 3-15, sometimes even 30 nt! This problem only occurs in strand-specific mode. I assembled strand-specific RNAseq-data (TruSeq) from a polyploid plant with the command:
rnaspades.py -o "$outdir" --ss-rf --dataset "$samplesFile" -m "$max_memory" -t "$threads"
Interestingly, the transrate score was high at 0.39 - 0.45 (raw - optimised) and the assemblies were essentially BUSCO-complete (only 2% of empryhophyte BUSCOs were missing). Using Salmon, a mapping rate of 96% was found, and the library type was guessed correctly as "ISR" (via --libtype A).
However, as previously predicted in this thread, the relative ORF-lengths were substantially decreased compared to a simple paired-end assembly with the same data (ORF-lengths were 52% vs 65% of total transcript lengths).
Could this be a bug with the dataset file? Might there be biological reasons for this behaviour (high transcriptomic complexity...)?
However, thank you so much for this program, it has otherwise worked like a charm!
PS: The dataset-file I used (In yaml format, but with a .txt extension, since github seemingly does not allow upload of .yaml-files): samples_list.txt
@lmuenter Please open a separate issue, do not hijack the closed ones
Note: The info here has been updated on 3 Oct 2018 based on errors pointed out by @andrewprzh . This changes some of the discussion about strand-specificity.
First off, thanks for developing and supporting rnaSPAdes! I am testing v3.11.1 for strand-specific transcriptome assembly of hexaploid blueberry, and have encountered two issues:
--ss-rf
and--pe1-rf
since "The data set is strand-specific and first read in pair corresponds to reverse gene strand" and I assumed that reverse gene strand is the same as reverse orientation with respect to the transcript. However, I received the below warning in the log with--pe1-rf
:0:10:19.497 2G / 8G INFO General (pair_info_count.cpp : 351) Insert size = 62.0174, deviation = 28.041, left quantile = 21, right quantile = 97, read length = 94
0:10:19.497 2G / 8G WARN General (pair_info_count.cpp : 355) Estimated mean insert size 62.0174 is very small compared to read length 94
When run with--ss-fr
or--ss-rf
or--pe-fr
, I do not get any warnings and estimated insert size is accurate. With--ss-fr
:0:14:01.029 2G / 10G INFO General (pair_info_count.cpp : 351) Insert size = 144.063, deviation = 57.2587, left quantile = 83, right quantile = 218, read length = 94
With--ss-rf
:0:10:56.513 2G / 8G INFO General (pair_info_count.cpp : 351) Insert size = 144.065, deviation = 57.2611, left quantile = 83, right quantile = 218, read length = 94
With--pe-fr
(small difference is due to not normalizing reads as done for other runs):1:49:26.236 6G / 13G INFO General (pair_info_count.cpp : 351) Insert size = 141.018, deviation = 52.2852, left quantile = 86, right quantile = 208, read length = 94
Furthermore, when I map outputted transcripts to the related diploid genome and compare with genome annotations,
--ss-fr
produces more complete transcripts than--pe1-rf
. Also see averaeg transcript length in below table. What is the issue with--pe1-rf
?--ss-rf
and--ss-fr
runs, both show poly-N runs of length 10. Example:grep -v ">" rnaSPAdes-ss-rf/hard_filtered_transcripts.fasta | grep "N"
GCGAATTTGCCGGNNNNNNNNNNTGGCCGGAGTTGGTGACATCATAGTCGTCGGCGAATT
GGTTTGAGTCCACCATCTCCTGGCATNNNNNNNNNNCATTCCCACCAGGACCAGGTTTGA
GCATGNNNNNNNNNNACTCAAACCTGGTCCTGGTGGGAATGGGTTTGGTGGAAGCATGCC
ACGAAAAACCAACGCCAGCTGAAACAGAAGTCAAGAAGAAGGNNNNNNNNNNGCTGCAGA
This problem was worse for the--ss-fr
,--ss-rf
and--pe-fr
runs as shown in the table below. Why does the algorithm insert these poly-N runs in the transcripts?--pe1-rf
--pe1-fr
*--ss-rf
--ss-fr
*Note that this is unlike the other runs in that the dataset was not normalized before the run
Thanks in advance for your time!