bcgsc / RNA-Bloom

:hibiscus: reference-free transcriptome assembly for short and long reads
Other
85 stars 7 forks source link

Error assembling long reads #8

Closed sum732 closed 4 years ago

sum732 commented 4 years ago

Hi Ka Ming,

I tried to do hybrid assembly with Illumina PE reads and PacBio long reads. Got "Error assembling long reads". Here is a log that may be of help

RNA-Bloom v1.3.0
args: [-left, ../Sample_363_364_R2.fastq, -right, ../Sample_363_364_R1.fastq, -long, ../../../ZMW12345_Control.polished.hq.fasta, -ntcard, -t, 10, -outdir, Hybrid_Control_RNABloom]

name:   rnabloom
outdir: Hybrid_Control_RNABloom
WARNING: Output directory does not exist!
Created output directory at `Hybrid_Control_RNABloom`

K-mer counting with ntCard...
Running command: `ntcard -t 10 -k 17 -c 65535 -p Hybrid_Control_RNABloom/rnabloom @Hybrid_Control_RNABloom/rnabloom.ntcard.readslist.txt`...
Parsing histogram file `Hybrid_Control_RNABloom/rnabloom_k17.hist`...
Unique k-mers (k=17): 713,045,380
Min k-mer coverage threshold: 2
K-mer counting completed in 9m 27s

Bloom filters          Memory (GB)
====================================
de Bruijn graph:       1.5757214
k-mer counting:        5.7748213
====================================
Total:                 7.3505425

> Stage 1: Construct graph from reads (k=17)
[1] Parsing `../Sample_363_364_R2.fastq`...
[3] Parsing `../../../ZMW12345_Control.polished.hq.fasta`...
[2] Parsing `../Sample_363_364_R1.fastq`...
[3] Parsed 22,862 sequences.
[2] Parsed 217,746,749 sequences.
[1] Parsed 217,746,749 sequences.
Parsed 435,516,360 reads in total.
DBG Bloom filter FPR:                 0.9953146 %
Counting Bloom filter FPR:            1.0024519 %
> Stage 1 completed in 2h 40m 34s

> Stage 2: Correct long reads for "rnabloom"
Parsing `../../../ZMW12345_Control.polished.hq.fasta`...
Corrected Read Lengths Sampling Distribution (n=1000)
        min     q1      med     q3      max
        5540    6298    6778    7486    11727
Parsed 22,862 sequences.
        Kept:      22,659(99.11206%)
        Discarded: 203(0.8879363%)
        Artifacts: 43(0.18808503%)
> Stage 2 completed in 3m 2s

> Stage 3: Assemble long reads for "rnabloom"
Overlapped sequences: 1
         - artifacts: 1
         - unique:    1
before: 22,659  after: 22,659
ERROR: Error assembling long reads!

I am not sure what the actual error is from here.

Searching more logs and came across: "ranbloom.transcripts.fa.log"

[racon::Polisher::initialize] loaded target sequences 0.433007 s
[racon::Polisher::initialize] loaded sequences 0.469123 s
[racon::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence transcript!

Not sure what does this means. PacBio reads will be of different lengths, Illumina PE reads are of same length 2"150bp. Let me what else I can provide to help you. May be I would have to wait for you to release PacBio optimized version.

Thanks!

sum732 commented 4 years ago

Just to add. I had submitted 4 different jobs with different files (Ilumina PE and PacBio )and it seems 1 failed error shown is above. Two finished and both gave the following message

java.lang.StringIndexOutOfBoundsException: begin 0, end 206, length 200
        at java.base/java.lang.String.checkBoundsBeginEnd(String.java:3319)
        at java.base/java.lang.String.substring(String.java:1874)
        at rnabloom.olc.Layout.layoutBackbones(Layout.java:745)
        at rnabloom.olc.Layout.writeBackboneSequences(Layout.java:451)
        at rnabloom.olc.OverlapLayoutConcensus.layout(OverlapLayoutConcensus.java:196)
        at rnabloom.olc.OverlapLayoutConcensus.overlapWithMinimapAndLayout(OverlapLayoutConcensus.java:167)
        at rnabloom.olc.OverlapLayoutConcensus.overlapLayoutConcensus(OverlapLayoutConcensus.java:281)
        at rnabloom.RNABloom.assembleUnclusteredLongReads(RNABloom.java:2775)
        at rnabloom.RNABloom.assembleUnclusteredLongReads(RNABloom.java:4570)
        at rnabloom.RNABloom.main(RNABloom.java:6327)

However, In both the above case, I can see following files:

0       LONGREADS.ASSEMBLED
0       LONGREADS.CORRECTED
514K    rnabloom_k17.hist
75M     rnabloom.longreads.assembly_backbones.fa
757     rnabloom.longreads.assembly_backbones.fa.log
81M     rnabloom.longreads.corrected.long.fa
7.8K    rnabloom.longreads.corrected.repeats.fa
0       rnabloom.longreads.corrected.short.fa
153     rnabloom.ntcard.readslist.txt
36      rnabloom.longreads.corrected.long.fa
225     STARTED

I guess from all the files, file titled "rnabloom.longreads.corrected.long.fa" contains the assembled contigs? and this is the main file to be looked at. I don't know the importance ( or lack) other files that are created but empty.

What is the Java error and should assembly be deemed as susccess?

Finally, one last one is still processing at the time of writing.

kmnip commented 4 years ago

Can you post the content of this file?

rnabloom.longreads.assembly_backbones.fa.log
kmnip commented 4 years ago

rnabloom.longreads.corrected.long.fa is the error-corrected reads rnabloom.longreads.assembly_backbones.fa is the assembled sequence but not polished

If the long-read assembly were successful, then you should see rnabloom.transcripts.fa, which is the polished assembly.

sum732 commented 4 years ago

Can you post the content of this file?

rnabloom.longreads.assembly_backbones.fa.log

Sure here it is

[M::mm_idx_gen::3.344*1.22] collected minimizers
[M::mm_idx_gen::4.027*2.19] sorted minimizers
[M::main::4.027*2.19] loaded/built the index for 22659 target sequence(s)
[M::mm_mapopt_update::4.331*2.11] mid_occ = 173
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 22659
[M::mm_idx_stat::4.517*2.06] distinct minimizers: 8341171 (41.34% are singletons); average occurrences: 3.073; average spacing: 2.946
[M::worker_pipeline::15.240*6.71] mapped 22659 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: minimap2 -x ava-ont -r 150 -t 10 Hybrid_Control_RNABloom/rnabloom.longreads.corrected.long.fa Hybrid_Control_RNABloom/rnabloom.longreads.corrected.long.fa
[M::main] Real time: 15.326 sec; CPU: 102.338 sec; Peak RSS: 8.730 GB
sum732 commented 4 years ago

the java error (posted above) can be safely ignored ?

kmnip commented 4 years ago

Thanks! That confirms the minimap overlapping executed successfuly.

For the sample without the Java error, can you re-assemble with this option -lrrd 1?

For samples with the Java error, can you please check whether you have any duplicated read names in the long reads?

sum732 commented 4 years ago

So I should ignore the error message, as minimap completed ? ERROR: Error assembling long reads! Or You are saying to re-assemble with lrrd1 parameter.

Every successful (with no error shown) assembly is showing the java error.

I will check the pacbio read names to see duplication

kmnip commented 4 years ago

Yes, please reassemble the sample with ERROR: Error assembling long reads! using the option -lrrd 1.

Would you be able share your long reads for one of the samples? If you are ok, you can send it to kmnip@bcgsc.ca

Thanks!

sum732 commented 4 years ago

I was able to test on one sample set and it seems that read names were an issues as In the second round, I did not get the Java error.

Below showing comparisons that may interest you. I am still trying to make sense (Note that this is only preliminary comparison, I do intent to run BUSCO and runQUAST as standard comparison).

This is a hybrid assembly with Illumina reads PE and stranded( 2 replicates were merged to make one file, 216,670,190 as total reads, 2*150bp) and PacBio HiFi ( Reads information etc is shown below). When I provided the "Polished reads from IsoSeq"

Metrics Polished_Reads_RNASPADES Polished_Reads_RNABloom
Total bases 271,334,737 83,541,526
Total reads 325,774 26,436
max_len 28,330 10,907
mean_len 832.89 3160.14
median_len 265 2977
min_len 55 200
stddev_len 1495.21 1304.62

RNASpades assembled 325K reads with max len of 28K but mean len is 832K. At the same time, RNABloom assembled 26K reads with max len of 10K with mean len of 3KB. Surprising that min len from spades is 55, compared to RNAbloom of 220bp (I expect latter with 150bp of lllumina reads).

At this point of time in analysis hard to say much, but you think RNABloom is doing a better job? I am just but puzzled about 26K reads assembled with 10K as max len. Would have expected a bit more. (Note in the above input set, I got the Java error from RNABloom)

Next, I tried to input reads that have been collapsed and further corrected by cupcake ( from SQANTI2) This is for PacBio Only. Below is shown comparison between polish and post cupcake

Metrics Polished Cupcake
Total bases 83,728,011 70,380,897
Total reads 26,602 22,407
max_len 10,907 10,907
mean_len 3147.43 3141.02
median_len 2971 2966
min_len 51 81
stddev_len 1316.72 1317.45

As you can see from above that there is not much difference between polished reads and collapsed reads. Collapsing/processing reduced total input reads from 26K to 22K.

Next used the cupcake collapsed reads for assembly and below is a comparision

Metrics Cupcake_RNASPADES Cupcake_RNABloom
Total bases 270,978,644 12,158,931
Total reads 325,717 3,560
max_len 28,330 9,955
mean_len 831.95 3415.43
median_len 265 3233.5
min_len 55 356
stddev_len 1493.59 1286.82

No Java error was shown by RNAbloom this time.

There is an impact with change of input data on RNABloom then on RNASpades assembly. From 26K reads to 3K reads. There is a significant increase in mean read len, as slight impact ( decrease) on max contig len

Like I said, I am still very early in stage and will do a BUSCO and runQUAST as standard comparison. However, based on this limited set, you think RNABloom worked well with more processed data set ( cupcake)?

I myself cant say much unless I map back reads to reference. But over all, I did hoped to get larger contigs from assemblers ( like RNASpades, but at the same time, 325,774 reads vs 26,436 by RNABloom also is very contrasting). I will post my further findings here soon.

Would welcome any thoughts you have.

kmnip commented 4 years ago

Hi @sum732 ,

Thanks for reporting these results!

The real difference here is that RNA-Bloom (in its currently experimental hybrid mode) only uses the short reads to polish the long-read assembly; there were no assembly of the short reads. That is why you see only 26,436 assembled transcripts. Of course, if there were more long reads, you would definitely see much more assembled transcripts.

sum732 commented 4 years ago

Thanks Ka Ming for clarification. Its true that we don't have much DOC in PacBio reads at this time. Do you plan to release any newer version anytime soon?

From above, is it better to provide more filtering reads ( from cupcake_SQANTI) to RNABloom?

kmnip commented 4 years ago

I try to make new releases regularly. I do plan to make a new release (hopefully later this week or early next week) that includes the PacBio option I mentioned before along with a few bugfixes.

Given the low read depth, providing more reads would certainly help the assembly. You may consider using multiple samples of long reads in a single assembly.

sum732 commented 4 years ago

Great, I will keep my eyes on the next version. I am providing a combined reads from 4 ZMW cells. We do plan to repeat with Sequel2

kmnip commented 4 years ago

Hi @sum732 ,

As promised, I have made a new release with the PacBio option. Please see: https://github.com/bcgsc/RNA-Bloom/releases/tag/v1.3.1

The -lrpb option simply uses the map-pb and ava-pb modes in minimap2.

Ka Ming

sum732 commented 4 years ago

Great, Thanks! I will try it and let you know the results.

sum732 commented 4 years ago

I tried to do conda update, but v1.3.1 is not been seen

https://anaconda.org/search?q=rnabloom 1.3.0 is onlt shown on all (anaconda/conda).

sum732 commented 4 years ago

I got the Jar file and got the assembly going. Got following message: ERROR: Error assembling long reads!

Following in the : rnabloom.transcripts.fa.log

[racon::Polisher::initialize] loaded target sequences 0.541750 s
[racon::Polisher::initialize] loaded sequences 0.547950 s
[racon::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence transcript!

Command used: java -jar ~/Research/Programs/RNABloom/RNA-Bloom_v1.3.1/RNA-Bloom.jar -left ../Sample_363_364_R2.fastq -right ../Sample_363_364_R1.fastq -long ../../../ZMW12345_Case.polished.hq.fasta -lrpb -ntcard -t 12 -outdir Hybrid_RNABloomv1.3.1

kmnip commented 4 years ago

It looks like that's the same error you reported earlier in the first post. Please check for duplicated read names in the long reads. If you don't have a a lot of input reads, then you need use the option -lrrd 1. See if that helps.

kmnip commented 4 years ago

Automatic update on bioconda usually requires someone to review the integration of new software versions. Having said that, bioconda should have the new version of RNA-Bloom now.

sum732 commented 4 years ago

Yea sorry about that, realised later. I changed the input and here is the output

stats_info      bases   12272180
stats_info      reads   3572
stats_len       max     9955
stats_len       mean    3435.66
stats_len       min     356

So only 3K transcripts were found. Much less than rnaSpades. Table earlier shows the assembled rnaSpades.

kmnip commented 4 years ago

Yes, it's because there is no assembly of short reads, as mentioned before.

Johnsonzcode commented 1 year ago

Dear @kmnip

I met the same error message but a new versoin: v2.0.1 with pacbio long reads.

[racon::Polisher::initialize] loaded target sequences 0.951644 s
[racon::Polisher::initialize] loaded sequences 12.405277 s
[racon::Overlap::transmute] error: unequal lengths in sequence and overlap file for sequence ERR1346012.32985!

It seems that the length in paf file is not equal to the length in reads file. How could I solve it ?

Best Johnson

kmnip commented 1 year ago

@Johnsonzcode Can you list the output files in your directory? (e.g. via ls -lh) Can you also please check whether your reads have any duplicated IDs?

Johnsonzcode commented 1 year ago

Of cource.

(rnabloom) [user01@cu01 OUTDIR]$ l
total 681M
-rw-rw-r-- 1 user01 user01  235 Apr  7 16:21 rnabloom.longreads.assembly4.pol.fa.log
-rw-rw-r-- 1 user01 user01    0 Apr  7 16:21 rnabloom.longreads.assembly4.pol.fa
-rw-rw-r-- 1 user01 user01  13M Apr  7 16:21 rnabloom.longreads.assembly3.map.paf.gz
-rw-rw-r-- 1 user01 user01  941 Apr  7 16:21 rnabloom.longreads.assembly3.map.paf.gz.log
-rw-rw-r-- 1 user01 user01  25M Apr  7 16:20 rnabloom.longreads.assembly2.uni.fa.gz
-rw-rw-r-- 1 user01 user01  26K Apr  7 16:20 rnabloom.longreads.assembly1.nr.fa.gz.dot.gz
-rw-rw-r-- 1 user01 user01  759 Apr  7 16:20 rnabloom.longreads.assembly2.uni.fa.gz.log
-rw-rw-r-- 1 user01 user01  27M Apr  7 16:20 rnabloom.longreads.assembly1.nr.fa.gz
-rw-rw-r-- 1 user01 user01  964 Apr  7 16:20 rnabloom.longreads.assembly1.nr.fa.gz.log
-rw-rw-r-- 1 user01 user01   75 Apr  7 16:13 STARTED
-rw-rw-r-- 1 user01 user01   43 Apr  7 16:13 rnabloom.ntcard.readslist.txt
-rw-rw-r-- 1 user01 user01    0 Apr  7 14:27 LONGREADS.CORRECTED
-rw-rw-r-- 1 user01 user01 300M Apr  7 14:27 rnabloom.longreads.corrected.long.seed.fa.gz
-rw-rw-r-- 1 user01 user01  73K Apr  7 14:19 rnabloom.longreads.corrected.polya.txt.gz
-rw-rw-r-- 1 user01 user01 4.4K Apr  7 14:19 rnabloom.longreads.corrected.repeats.fa.gz
-rw-rw-r-- 1 user01 user01 1.6M Apr  7 14:19 rnabloom.longreads.corrected.short.fa.gz
-rw-rw-r-- 1 user01 user01 317M Apr  7 14:19 rnabloom.longreads.corrected.long.fa.gz
-rw-rw-r-- 1 user01 user01  48K Apr  7 14:19 rnabloom.longreads.corrected.long.lengths.txt
-rw-rw-r-- 1 user01 user01   22 Apr  7 14:17 rnabloom_k35.hist.log
-rw-rw-r-- 1 user01 user01 502K Apr  7 14:17 rnabloom_k35.hist

Yes, you are right. There is duplicated IDs.

(samtools) [user01@cu01 female]$ wc -l name.txt
1300846 name.txt
(samtools) [user01@cu01 female]$ sort -u name.txt |wc -l
560689

I will check my data.

Thank you so much!