Closed sum732 closed 4 years ago
Just to add. I had submitted 4 different jobs with different files (Ilumina PE and PacBio )and it seems 1 failed error shown is above. Two finished and both gave the following message
java.lang.StringIndexOutOfBoundsException: begin 0, end 206, length 200
at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
at java.base/java.lang.String.substring(String.java:1874)
at rnabloom.olc.Layout.layoutBackbones(Layout.java:745)
at rnabloom.olc.Layout.writeBackboneSequences(Layout.java:451)
at rnabloom.olc.OverlapLayoutConcensus.layout(OverlapLayoutConcensus.java:196)
at rnabloom.olc.OverlapLayoutConcensus.overlapWithMinimapAndLayout(OverlapLayoutConcensus.java:167)
at rnabloom.olc.OverlapLayoutConcensus.overlapLayoutConcensus(OverlapLayoutConcensus.java:281)
at rnabloom.RNABloom.assembleUnclusteredLongReads(RNABloom.java:2775)
at rnabloom.RNABloom.assembleUnclusteredLongReads(RNABloom.java:4570)
at rnabloom.RNABloom.main(RNABloom.java:6327)
However, In both the above case, I can see following files:
0 LONGREADS.ASSEMBLED
0 LONGREADS.CORRECTED
514K rnabloom_k17.hist
75M rnabloom.longreads.assembly_backbones.fa
757 rnabloom.longreads.assembly_backbones.fa.log
81M rnabloom.longreads.corrected.long.fa
7.8K rnabloom.longreads.corrected.repeats.fa
0 rnabloom.longreads.corrected.short.fa
153 rnabloom.ntcard.readslist.txt
36 rnabloom.longreads.corrected.long.fa
225 STARTED
I guess from all the files, file titled "rnabloom.longreads.corrected.long.fa" contains the assembled contigs? and this is the main file to be looked at. I don't know the importance ( or lack) other files that are created but empty.
What is the Java error and should assembly be deemed as susccess?
Finally, one last one is still processing at the time of writing.
Can you post the content of this file?
rnabloom.longreads.assembly_backbones.fa.log
rnabloom.longreads.corrected.long.fa
is the error-corrected reads
rnabloom.longreads.assembly_backbones.fa
is the assembled sequence but not polished
If the long-read assembly were successful, then you should see rnabloom.transcripts.fa
, which is the polished assembly.
Can you post the content of this file?
rnabloom.longreads.assembly_backbones.fa.log
Sure here it is
[M::mm_idx_gen::3.344*1.22] collected minimizers
[M::mm_idx_gen::4.027*2.19] sorted minimizers
[M::main::4.027*2.19] loaded/built the index for 22659 target sequence(s)
[M::mm_mapopt_update::4.331*2.11] mid_occ = 173
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 22659
[M::mm_idx_stat::4.517*2.06] distinct minimizers: 8341171 (41.34% are singletons); average occurrences: 3.073; average spacing: 2.946
[M::worker_pipeline::15.240*6.71] mapped 22659 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -x ava-ont -r 150 -t 10 Hybrid_Control_RNABloom/rnabloom.longreads.corrected.long.fa Hybrid_Control_RNABloom/rnabloom.longreads.corrected.long.fa
[M::main] Real time: 15.326 sec; CPU: 102.338 sec; Peak RSS: 8.730 GB
the java error (posted above) can be safely ignored ?
Thanks! That confirms the minimap overlapping executed successfuly.
For the sample without the Java error, can you re-assemble with this option -lrrd 1
?
For samples with the Java error, can you please check whether you have any duplicated read names in the long reads?
So I should ignore the error message, as minimap completed ?
ERROR: Error assembling long reads!
Or You are saying to re-assemble with lrrd1
parameter.
Every successful (with no error shown) assembly is showing the java error.
I will check the pacbio read names to see duplication
Yes, please reassemble the sample with ERROR: Error assembling long reads!
using the option -lrrd 1
.
Would you be able share your long reads for one of the samples? If you are ok, you can send it to kmnip@bcgsc.ca
Thanks!
I was able to test on one sample set and it seems that read names were an issues as In the second round, I did not get the Java error.
Below showing comparisons that may interest you. I am still trying to make sense (Note that this is only preliminary comparison, I do intent to run BUSCO and runQUAST as standard comparison).
This is a hybrid assembly with Illumina reads PE and stranded( 2 replicates were merged to make one file, 216,670,190 as total reads, 2*150bp) and PacBio HiFi ( Reads information etc is shown below). When I provided the "Polished reads from IsoSeq"
Metrics | Polished_Reads_RNASPADES | Polished_Reads_RNABloom |
---|---|---|
Total bases | 271,334,737 | 83,541,526 |
Total reads | 325,774 | 26,436 |
max_len | 28,330 | 10,907 |
mean_len | 832.89 | 3160.14 |
median_len | 265 | 2977 |
min_len | 55 | 200 |
stddev_len | 1495.21 | 1304.62 |
RNASpades assembled 325K reads with max len of 28K but mean len is 832K. At the same time, RNABloom assembled 26K reads with max len of 10K with mean len of 3KB. Surprising that min len from spades is 55, compared to RNAbloom of 220bp (I expect latter with 150bp of lllumina reads).
At this point of time in analysis hard to say much, but you think RNABloom is doing a better job? I am just but puzzled about 26K reads assembled with 10K as max len. Would have expected a bit more. (Note in the above input set, I got the Java error from RNABloom)
Next, I tried to input reads that have been collapsed and further corrected by cupcake ( from SQANTI2) This is for PacBio Only. Below is shown comparison between polish and post cupcake
Metrics | Polished | Cupcake |
---|---|---|
Total bases | 83,728,011 | 70,380,897 |
Total reads | 26,602 | 22,407 |
max_len | 10,907 | 10,907 |
mean_len | 3147.43 | 3141.02 |
median_len | 2971 | 2966 |
min_len | 51 | 81 |
stddev_len | 1316.72 | 1317.45 |
As you can see from above that there is not much difference between polished reads and collapsed reads. Collapsing/processing reduced total input reads from 26K to 22K.
Next used the cupcake collapsed reads for assembly and below is a comparision
Metrics | Cupcake_RNASPADES | Cupcake_RNABloom |
---|---|---|
Total bases | 270,978,644 | 12,158,931 |
Total reads | 325,717 | 3,560 |
max_len | 28,330 | 9,955 |
mean_len | 831.95 | 3415.43 |
median_len | 265 | 3233.5 |
min_len | 55 | 356 |
stddev_len | 1493.59 | 1286.82 |
No Java error was shown by RNAbloom this time.
There is an impact with change of input data on RNABloom then on RNASpades assembly. From 26K reads to 3K reads. There is a significant increase in mean read len, as slight impact ( decrease) on max contig len
Like I said, I am still very early in stage and will do a BUSCO and runQUAST as standard comparison. However, based on this limited set, you think RNABloom worked well with more processed data set ( cupcake)?
I myself cant say much unless I map back reads to reference. But over all, I did hoped to get larger contigs from assemblers ( like RNASpades, but at the same time, 325,774 reads vs 26,436 by RNABloom also is very contrasting). I will post my further findings here soon.
Would welcome any thoughts you have.
Hi @sum732 ,
Thanks for reporting these results!
The real difference here is that RNA-Bloom (in its currently experimental hybrid mode) only uses the short reads to polish the long-read assembly; there were no assembly of the short reads. That is why you see only 26,436 assembled transcripts. Of course, if there were more long reads, you would definitely see much more assembled transcripts.
Thanks Ka Ming for clarification. Its true that we don't have much DOC in PacBio reads at this time. Do you plan to release any newer version anytime soon?
From above, is it better to provide more filtering reads ( from cupcake_SQANTI) to RNABloom?
I try to make new releases regularly. I do plan to make a new release (hopefully later this week or early next week) that includes the PacBio option I mentioned before along with a few bugfixes.
Given the low read depth, providing more reads would certainly help the assembly. You may consider using multiple samples of long reads in a single assembly.
Great, I will keep my eyes on the next version. I am providing a combined reads from 4 ZMW cells. We do plan to repeat with Sequel2
Hi @sum732 ,
As promised, I have made a new release with the PacBio option. Please see: https://github.com/bcgsc/RNA-Bloom/releases/tag/v1.3.1
The -lrpb
option simply uses the map-pb
and ava-pb
modes in minimap2.
Ka Ming
Great, Thanks! I will try it and let you know the results.
I tried to do conda update, but v1.3.1 is not been seen
https://anaconda.org/search?q=rnabloom 1.3.0 is onlt shown on all (anaconda/conda).
I got the Jar file and got the assembly going.
Got following message:
ERROR: Error assembling long reads!
Following in the :
rnabloom.transcripts.fa.log
[racon::Polisher::initialize] loaded target sequences 0.541750 s
[racon::Polisher::initialize] loaded sequences 0.547950 s
[racon::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence transcript!
Command used:
java -jar ~/Research/Programs/RNABloom/RNA-Bloom_v1.3.1/RNA-Bloom.jar -left ../Sample_363_364_R2.fastq -right ../Sample_363_364_R1.fastq -long ../../../ZMW12345_Case.polished.hq.fasta -lrpb -ntcard -t 12 -outdir Hybrid_RNABloomv1.3.1
It looks like that's the same error you reported earlier in the first post.
Please check for duplicated read names in the long reads.
If you don't have a a lot of input reads, then you need use the option -lrrd 1
. See if that helps.
Automatic update on bioconda usually requires someone to review the integration of new software versions. Having said that, bioconda should have the new version of RNA-Bloom now.
Yea sorry about that, realised later. I changed the input and here is the output
stats_info bases 12272180
stats_info reads 3572
stats_len max 9955
stats_len mean 3435.66
stats_len min 356
So only 3K transcripts were found. Much less than rnaSpades. Table earlier shows the assembled rnaSpades.
Yes, it's because there is no assembly of short reads, as mentioned before.
Dear @kmnip
I met the same error message but a new versoin: v2.0.1 with pacbio long reads.
[racon::Polisher::initialize] loaded target sequences 0.951644 s
[racon::Polisher::initialize] loaded sequences 12.405277 s
[racon::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence ERR1346012.32985!
It seems that the length in paf
file is not equal to the length in reads file.
How could I solve it ?
Best Johnson
@Johnsonzcode
Can you list the output files in your directory? (e.g. via ls -lh
)
Can you also please check whether your reads have any duplicated IDs?
Of cource.
(rnabloom) [user01@cu01 OUTDIR]$ l
total 681M
-rw-rw-r-- 1 user01 user01 235 Apr 7 16:21 rnabloom.longreads.assembly4.pol.fa.log
-rw-rw-r-- 1 user01 user01 0 Apr 7 16:21 rnabloom.longreads.assembly4.pol.fa
-rw-rw-r-- 1 user01 user01 13M Apr 7 16:21 rnabloom.longreads.assembly3.map.paf.gz
-rw-rw-r-- 1 user01 user01 941 Apr 7 16:21 rnabloom.longreads.assembly3.map.paf.gz.log
-rw-rw-r-- 1 user01 user01 25M Apr 7 16:20 rnabloom.longreads.assembly2.uni.fa.gz
-rw-rw-r-- 1 user01 user01 26K Apr 7 16:20 rnabloom.longreads.assembly1.nr.fa.gz.dot.gz
-rw-rw-r-- 1 user01 user01 759 Apr 7 16:20 rnabloom.longreads.assembly2.uni.fa.gz.log
-rw-rw-r-- 1 user01 user01 27M Apr 7 16:20 rnabloom.longreads.assembly1.nr.fa.gz
-rw-rw-r-- 1 user01 user01 964 Apr 7 16:20 rnabloom.longreads.assembly1.nr.fa.gz.log
-rw-rw-r-- 1 user01 user01 75 Apr 7 16:13 STARTED
-rw-rw-r-- 1 user01 user01 43 Apr 7 16:13 rnabloom.ntcard.readslist.txt
-rw-rw-r-- 1 user01 user01 0 Apr 7 14:27 LONGREADS.CORRECTED
-rw-rw-r-- 1 user01 user01 300M Apr 7 14:27 rnabloom.longreads.corrected.long.seed.fa.gz
-rw-rw-r-- 1 user01 user01 73K Apr 7 14:19 rnabloom.longreads.corrected.polya.txt.gz
-rw-rw-r-- 1 user01 user01 4.4K Apr 7 14:19 rnabloom.longreads.corrected.repeats.fa.gz
-rw-rw-r-- 1 user01 user01 1.6M Apr 7 14:19 rnabloom.longreads.corrected.short.fa.gz
-rw-rw-r-- 1 user01 user01 317M Apr 7 14:19 rnabloom.longreads.corrected.long.fa.gz
-rw-rw-r-- 1 user01 user01 48K Apr 7 14:19 rnabloom.longreads.corrected.long.lengths.txt
-rw-rw-r-- 1 user01 user01 22 Apr 7 14:17 rnabloom_k35.hist.log
-rw-rw-r-- 1 user01 user01 502K Apr 7 14:17 rnabloom_k35.hist
Yes, you are right. There is duplicated IDs.
(samtools) [user01@cu01 female]$ wc -l name.txt
1300846 name.txt
(samtools) [user01@cu01 female]$ sort -u name.txt |wc -l
560689
I will check my data.
Thank you so much!
Hi Ka Ming,
I tried to do hybrid assembly with Illumina PE reads and PacBio long reads. Got "Error assembling long reads". Here is a log that may be of help
I am not sure what the actual error is from here.
Searching more logs and came across: "ranbloom.transcripts.fa.log"
Not sure what does this means. PacBio reads will be of different lengths, Illumina PE reads are of same length 2"150bp. Let me what else I can provide to help you. May be I would have to wait for you to release PacBio optimized version.
Thanks!