Closed sid5427 closed 4 years ago
Hi,
Thanks using RNA-Bloom!
I have several questions for you that would help me trouble-shoot the problem:
The final stage of long-read assembly in RNA-Bloom can use quite a bit of memory for large samples (~6 million reads in my human direct RNA dataset). I do have a fix for this to be included in the next version, which is coming later this week.
When you encounter a Java heap space error, you would have to increase the JVM heap space (-Xmx
) and increase the memory allocation for your slurm job.
You don't need to provide short RNA-seq reads for long read assembly, but the short reads certainly do help with the initial error correction stage.
To use both short reads and long reads:
java -jar RNA-Bloom.jar -left short_reads_1.fastq -right short_reads_2.fq -rcr -long nanopore_reads.fq ...
Thanks! Ka Ming
Hi Ka Ming, Thanks for replying -
I have 131585 pacbio CCS reads & 8656693 nanopore reads(including smaller reads). At this time, I have only tried running RNAbloom with nanopore reads. Unfortunately, after checking my logs, it seems I had run it with all reads including smaller ones which I should have filtered out - I should have used reads larger than 200bp (or maybe 1000bp+).
Command used to run RNAbloom
(srun --mem-per-cpu=16G --job-name=RNA_bloom -c 8 -p <partition> -A <labname> -t 2-00:00 \
java -jar ../RNA-Bloom.jar -long nanopore_file.fastq \
-ntcard -t 8 -outdir rnabloom_assembly) >& rnabloom_assembly/rnabloom_run_1.txt &
Last 20 lines of log file
Assembling cluster `21533`...
Assembling cluster `21534`...
Assembling cluster `21535`...
Assembling cluster `21536`...
Assembling cluster `21537`...
Overlapped sequences: 2,188,104
- unique: 2,087,630
- dovetail: 603,728
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.BitSet.initWords(BitSet.java:167)
at java.base/java.util.BitSet.<init>(BitSet.java:162)
at org.jgrapht.alg.TransitiveReduction.reduce(TransitiveReduction.java:145)
at rnabloom.olc.Layout.layoutBackbones(Layout.java:480)
at rnabloom.olc.Layout.writeBackboneSequences(Layout.java:338)
at rnabloom.olc.OverlapLayoutConcensus.layout(OverlapLayoutConcensus.java:169)
at rnabloom.olc.OverlapLayoutConcensus.overlapLayoutConcensus(OverlapLayoutConcensus.java:296)
at rnabloom.RNABloom.assembleLongReads(RNABloom.java:2447)
at rnabloom.RNABloom.assembleLongReads(RNABloom.java:3773)
at rnabloom.RNABloom.main(RNABloom.java:5088)
srun: error: lewis4-r630-htc4-node285: task 0: Exited with exit code 1
I did try to increase the heapsize, but apparently the heapsize is already set at max for our cluster or probably not accessible to me, which is strange as it should be user accessible - have done it before on our lab servers, and my laptop.
Any help with this issue would be appreciated! ---Sidharth Sen
Thanks for reporting these details.
The main issue is there is one cluster (ie. "21537") that has a large number of reads (ie. 2.19 million reads), causing the memory usage to blow up. Adjusting the JVM heap size won't help.
I will make a new release of RNA-Bloom by the end of this week. You can possibly try the new version then. I will keep you posted. Thank you!
Hi @sid5427,
I have made a new release: https://github.com/bcgsc/RNA-Bloom/releases/tag/v1.2.2 Let me know if that helps with the memory issue.
Ka Ming
Hi Ka Ming, Awesome - I was anyway going to try a run after removing sequences less than 200bp. Now I'll update RNA-bloom and report back the results. Thanks for the quick turn around on this.
Hi Ka Ming, So I finally got back to running RNA-bloom on my dataset. I used the updated version and also updated all the other dependencies just to be sure. Unfortunately, it seems to be having another issue, different from last time I think.
I used the same command, settings and nanopore file (cDNA) as before. Please do note, the nanopore file is for a maize transcriptome, so quite large.
last 20 lines of the log file (the last line repeats about 200 times.)-->
> Stage 4: Assemble long reads for "rnabloom"
Total of 810306 clusters to be assembled
Assembling cluster `0`...
Overlapped sequences: 4,588,506
- artifacts: 6
- unique: 4,404,862
- dovetail: 45,229
G: |V|=90,458 |E|=2,069,428
Exception in thread "main" java.lang.StackOverflowError
at java.base/java.util.HashMap.putVal(HashMap.java:643)
at java.base/java.util.HashMap.put(HashMap.java:607)
at org.jgrapht.alg.connectivity.BiconnectivityInspector.dfs(BiconnectivityInspector.java:304)
at org.jgrapht.alg.connectivity.BiconnectivityInspector.dfs(BiconnectivityInspector.java:315)
at org.jgrapht.alg.connectivity.BiconnectivityInspector.dfs(BiconnectivityInspector.java:315)
at org.jgrapht.alg.connectivity.BiconnectivityInspector.dfs(BiconnectivityInspector.java:315)
at org.jgrapht.alg.connectivity.BiconnectivityInspector.dfs(BiconnectivityInspector.java:315)
at org.jgrapht.alg.connectivity.BiconnectivityInspector.dfs(BiconnectivityInspector.java:315)
at org.jgrapht.alg.connectivity.BiconnectivityInspector.dfs(BiconnectivityInspector.java:315)
at org.jgrapht.alg.connectivity.BiconnectivityInspector.dfs(BiconnectivityInspector.java:315)
full log file to download --> https://1drv.ms/t/s!Alr23pNlJf37isFR_8SobzqMiuA9sQ?e=HrHYcl
Any help with this is appreciated! --Sidharth Sen
Hi Sidharth,
Thanks for reporting the error! I also encountered the same error for my plant dataset. I have a fix for it, please use the latest release: https://github.com/bcgsc/RNA-Bloom/releases/tag/v1.2.3
Thanks, Ka Ming
Hi Ka Ming, Awesome - I'll give it a try asap. I checked bioconda's repo site - It's not yet updated to the latest version. Could you please push it to there as well? https://anaconda.org/bioconda/rnabloom
Thanks & Regards ----Sidharth Sen
The automatic pull-request usually takes a while after a fresh release, but I confirm that it has been updated.
Hi Ka Ming, I can confirm it has updated as well, and I have already started a fresh run with the updated tool. Will report back how it goes. Thanks for your help --Sidharth Sen
Hi Ka Ming and Sidharth,
I've been folllowing this thread as I was also experiencing the same memory issue that Sidharth reported at the beginning of the thread when using RNA-bloom version 1.2.0. I've tried increasing my heapsize up to 1Tb and I've filtered out all of the small reads, but still encountering this issue. Normally the software runs for about 8 days before dying with the following error (end of log file). I've checked the cluster usage it is running out of memory:
Assembling cluster `15241`...
Overlapped sequences: 2,823,951
- unique: 2,341,740
- dovetail: 3,410,262
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.BitSet.initWords(BitSet.java:167)
at java.base/java.util.BitSet.<init>(BitSet.java:162)
at org.jgrapht.alg.TransitiveReduction.reduce(TransitiveReduction.java:145)
at rnabloom.olc.Layout.layoutBackbones(Layout.java:480)
at rnabloom.olc.Layout.writeBackboneSequences(Layout.java:338)
at rnabloom.olc.OverlapLayoutConcensus.layout(OverlapLayoutConcensus.java:169)
at rnabloom.olc.OverlapLayoutConcensus.overlapLayoutConcensus(OverlapLayoutConcensus.java:296)
at rnabloom.RNABloom.assembleLongReads(RNABloom.java:2447)
at rnabloom.RNABloom.assembleLongReads(RNABloom.java:3773)
at rnabloom.RNABloom.main(RNABloom.java:5088)
I've also now tried to run version 1.2.3 as suggested, but this is also failing. However, this fails after only 30 minutes and it's not running out of memory (only 6G used out of a possible 1T). As a note I'm using JDK version 12.0.1. Here is the log file:
RNA-Bloom v1.2.3
args: [-long, Basecalled_Final/Filter.fq.gz, -outdir, RNA_Bloom_v1.2.3/, -threads, 8]
name: rnabloom
outdir: RNA_Bloom_v1.2.3/
Min k-mer coverage threshold: 3
Bloom filters Memory (GB)
====================================
de Bruijn graph: 0.8594979
k-mer counting: 3.4379916
====================================
Total: 4.2974896
> Stage 1: Construct graph from reads (k=17)
[1] Parsing `Basecalled_Final/Filter.fq.gz`...
[1] Parsed 9,086,352 sequences.
Parsed 9,086,352 reads in total.
DBG Bloom filter FPR: 8.421373 %
Counting Bloom filter FPR: 4.02473 %
WARNING: Bloom filter FPR is higher than the maximum allowed FPR (1.0%)!
Adjusting Bloom filter sizes...
ERROR: null
java.lang.NullPointerException
at rnabloom.RNABloom.getOptimalBloomFilterSizes(RNABloom.java:786)
at rnabloom.RNABloom.main(RNABloom.java:5589)
Any ideas!?
All the Best,
Sophie
Hi Sophie,
The OutOfMemoryError and StackOverflowError both came from one module of the graph library I was using. I gave up on that module and I implemented my own solution in v1.2.3 that is faster and uses a lot less memory. So, you shouldn't have issues in that part of the assembly.
The NullPointerException is a bug. Thanks for reporting it, I will fix it in the next release.
To bypass the error, you can either set the appropriate Bloom filter size by using the -ntcard
option, ie.
-ntcard -long Basecalled_Final/Filter.fq.gz -outdir RNA_Bloom_v1.2.3/ -threads 8
or if you don't have ntCard installed, simply allocate more memory for the Bloom filters, ie.
-mem 16 -long Basecalled_Final/Filter.fq.gz -outdir RNA_Bloom_v1.2.3/ -threads 8
Let me know if that works for you.
Thanks! Ka Ming
Hi Ka Ming.
I have good news, the assembly finished and it terminated successfully. I can see different fasta output files as well.
Assembling cluster `810283`...
Assembling cluster `810284`...
Assembling cluster `810285`...
Assembling cluster `810286`...
Assembling cluster `810287`...
Assembling cluster `810288`...
Assembling cluster `810289`...
Assembling cluster `810290`...
Assembling cluster `810291`...
Assembling cluster `810292`...
Assembling cluster `810293`...
Assembling cluster `810294`...
Assembling cluster `810295`...
Assembling cluster `810296`...
Assembling cluster `810297`...
Assembling cluster `810298`...
Assembling cluster `810299`...
Assembling cluster `810300`...
Assembling cluster `810301`...
Assembling cluster `810302`...
Assembling cluster `810303`...
Assembling cluster `810304`...
Assembling cluster `810305`...
Inter-cluster assembly...
Overlapped sequences: 4,813,477
- artifacts: 22
- unique: 3,444,773
- dovetail: 96,924
G: |V|=193,848 |E|=309,726
G: |V|=193,848 |E|=66,088
before: 8,252,969 after: 6,851,221
> Stage 4 completed in 29h 57m 49s
Total runtime: 29h 57m 49s
Sophie - I had allocated a large amount of ram to my job - about 480gb - maxing out the available memory on one node of our cluster - that seemed to have helped along with the fixes ka ming had pushed out earlier.
Ka Ming - since this is my first time working with rnabloom - what output files should I be using for downstream analysis - something similar to what we would do for RNA-seq assembly and then evaluation with BUSCO, rnaQUAST or hisat2, etc?
these are the files that I have as output - I am a bit worried that the LONGREADS.CORRECTED
file is empty and there are no files inside rnabloom.longreads.clusters
directory.
-rw-rw-r--. 1 ssen _ 0 Feb 26 15:43 LONGREADS.CORRECTED
-rw-rw-r--. 1 ssen _ 510K Feb 26 10:59 rnabloom_k17.hist
drwxrwsr-x. 2 ssen _ 2 Feb 26 15:43 rnabloom.longreads.clusters
-rw-rw-r--. 1 ssen _ 51M Feb 26 15:43 rnabloom.longreads.corrected.e0.med_q3.fa
-rw-rw-r--. 1 ssen _ 9.0M Feb 26 15:43 rnabloom.longreads.corrected.e0.min_q1.fa
-rw-rw-r--. 1 ssen _ 23M Feb 26 15:43 rnabloom.longreads.corrected.e0.q1_med.fa
-rw-rw-r--. 1 ssen _ 161M Feb 26 15:43 rnabloom.longreads.corrected.e0.q3_max.fa
-rw-rw-r--. 1 ssen _ 8.7M Feb 26 15:43 rnabloom.longreads.corrected.e1.med_q3.fa
-rw-rw-r--. 1 ssen _ 3.1M Feb 26 15:43 rnabloom.longreads.corrected.e1.min_q1.fa
-rw-rw-r--. 1 ssen _ 5.5M Feb 26 15:43 rnabloom.longreads.corrected.e1.q1_med.fa
-rw-rw-r--. 1 ssen _ 35M Feb 26 15:43 rnabloom.longreads.corrected.e1.q3_max.fa
-rw-rw-r--. 1 ssen _ 12M Feb 26 15:43 rnabloom.longreads.corrected.e2.med_q3.fa
-rw-rw-r--. 1 ssen _ 5.0M Feb 26 15:43 rnabloom.longreads.corrected.e2.min_q1.fa
-rw-rw-r--. 1 ssen _ 6.8M Feb 26 15:43 rnabloom.longreads.corrected.e2.q1_med.fa
-rw-rw-r--. 1 ssen _ 65M Feb 26 15:43 rnabloom.longreads.corrected.e2.q3_max.fa
-rw-rw-r--. 1 ssen _ 79M Feb 26 15:43 rnabloom.longreads.corrected.e3.med_q3.fa
-rw-rw-r--. 1 ssen _ 37M Feb 26 15:43 rnabloom.longreads.corrected.e3.min_q1.fa
-rw-rw-r--. 1 ssen _ 51M Feb 26 15:43 rnabloom.longreads.corrected.e3.q1_med.fa
-rw-rw-r--. 1 ssen _ 589M Feb 26 15:43 rnabloom.longreads.corrected.e3.q3_max.fa
-rw-rw-r--. 1 ssen _ 465M Feb 26 15:43 rnabloom.longreads.corrected.e4.med_q3.fa
-rw-rw-r--. 1 ssen _ 134M Feb 26 15:43 rnabloom.longreads.corrected.e4.min_q1.fa
-rw-rw-r--. 1 ssen _ 255M Feb 26 15:43 rnabloom.longreads.corrected.e4.q1_med.fa
-rw-rw-r--. 1 ssen _ 2.1G Feb 26 15:43 rnabloom.longreads.corrected.e4.q3_max.fa
-rw-rw-r--. 1 ssen _ 914M Feb 26 15:43 rnabloom.longreads.corrected.e5.med_q3.fa
-rw-rw-r--. 1 ssen _ 193M Feb 26 15:43 rnabloom.longreads.corrected.e5.min_q1.fa
-rw-rw-r--. 1 ssen _ 461M Feb 26 15:43 rnabloom.longreads.corrected.e5.q1_med.fa
-rw-rw-r--. 1 ssen _ 1.9G Feb 26 15:43 rnabloom.longreads.corrected.e5.q3_max.fa
-rw-rw-r--. 1 ssen _ 34 Feb 26 10:56 rnabloom.ntcard.readslist.txt
-rw-rw-r--. 1 ssen _ 83 Feb 26 10:59 STARTED
(I can open another ticket to keep this thread clean if that's how you want to manage the topics...)
It's fine, no need to open another issue. :)
The final output file for your assembly should be rnabloom.transcripts.fa
.
Those empty files with names in capital letters (ie. DBG.DONE, LONGREADS.ASSEMBLED, LONGREADS.CLUSTERED, LONGREADS.CORRECTED) are completion stamps for different stage of the assembly. For example, LONGREADS.ASSEMBLED would indicate that everything has completed successfully.
Would you be able to find out the peak rss (memory usage) for your cluster job?
Hi Ka Ming,
job name -
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
20170408 RNA_bloom+ BioCompute xulab 32 COMPLETED 0:0
So according to slurm, the job was completed..
I was able to retrieve these stats using the "seff" command in slurm -
Job ID: 20170408
Cluster: lewis4
User/Group: ssen/ssen
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 32
CPU Utilized: 23-03:15:26
CPU Efficiency: 57.88% of 39-23:15:12 core-walltime
Memory Utilized: 275.45 GB
Memory Efficiency: 57.39% of 480.00 GB
Unfortunately, I do not see any of the file names with the capital letters except for LONGREADS.CORRECTED
No rnabloom.transcripts.fa
either - assuming this would be in the same output folder as specified
Any clues as to what might have happened?
Thanks & Regards -- Sidharth Sen
Thanks!
That's weird. Can you please check your output directory at the top of your cluster log? For example:
RNA-Bloom v1.2.3
args: [-t, 12, -fpr, 0.005, -outdir, /path/to/my/outdir, -long, /path/to/my/reads.fa.gz]
name: rnabloom
outdir: /path/to/my/outdir
...
Oh Darn it! Wrong output folder...good catch! sorry about that..
These are the correct output files -
-rw-rw-r--. 1 ssen _ 0 Feb 28 04:36 LONGREADS.ASSEMBLED
-rw-rw-r--. 1 ssen _ 0 Feb 25 05:08 LONGREADS.CLUSTERED
-rw-rw-r--. 1 ssen _ 0 Feb 22 16:54 LONGREADS.CORRECTED
-rw-rw-r--. 1 ssen _ 510K Feb 22 12:45 rnabloom_k17.hist
drwxrwsr-x. 2 ssen _ 1.6M Feb 27 14:47 rnabloom.longreads.assembly
drwxrwsr-x. 2 ssen _ 792K Feb 25 05:08 rnabloom.longreads.clusters
-rw-rw-r--. 1 ssen _ 64M Feb 22 16:54 rnabloom.longreads.corrected.e0.med_q3.fa
-rw-rw-r--. 1 ssen _ 9.2M Feb 22 16:54 rnabloom.longreads.corrected.e0.min_q1.fa
-rw-rw-r--. 1 ssen _ 26M Feb 22 16:54 rnabloom.longreads.corrected.e0.q1_med.fa
-rw-rw-r--. 1 ssen _ 268M Feb 22 16:54 rnabloom.longreads.corrected.e0.q3_max.fa
-rw-rw-r--. 1 ssen _ 116M Feb 22 16:54 rnabloom.longreads.corrected.e1.med_q3.fa
-rw-rw-r--. 1 ssen _ 11M Feb 22 16:54 rnabloom.longreads.corrected.e1.min_q1.fa
-rw-rw-r--. 1 ssen _ 39M Feb 22 16:54 rnabloom.longreads.corrected.e1.q1_med.fa
-rw-rw-r--. 1 ssen _ 622M Feb 22 16:54 rnabloom.longreads.corrected.e1.q3_max.fa
-rw-rw-r--. 1 ssen _ 345M Feb 22 16:54 rnabloom.longreads.corrected.e2.med_q3.fa
-rw-rw-r--. 1 ssen _ 45M Feb 22 16:54 rnabloom.longreads.corrected.e2.min_q1.fa
-rw-rw-r--. 1 ssen _ 137M Feb 22 16:54 rnabloom.longreads.corrected.e2.q1_med.fa
-rw-rw-r--. 1 ssen _ 1.5G Feb 22 16:54 rnabloom.longreads.corrected.e2.q3_max.fa
-rw-rw-r--. 1 ssen _ 525M Feb 22 16:54 rnabloom.longreads.corrected.e3.med_q3.fa
-rw-rw-r--. 1 ssen _ 100M Feb 22 16:54 rnabloom.longreads.corrected.e3.min_q1.fa
-rw-rw-r--. 1 ssen _ 244M Feb 22 16:54 rnabloom.longreads.corrected.e3.q1_med.fa
-rw-rw-r--. 1 ssen _ 1.7G Feb 22 16:54 rnabloom.longreads.corrected.e3.q3_max.fa
-rw-rw-r--. 1 ssen _ 317M Feb 22 16:54 rnabloom.longreads.corrected.e4.med_q3.fa
-rw-rw-r--. 1 ssen _ 111M Feb 22 16:54 rnabloom.longreads.corrected.e4.min_q1.fa
-rw-rw-r--. 1 ssen _ 194M Feb 22 16:54 rnabloom.longreads.corrected.e4.q1_med.fa
-rw-rw-r--. 1 ssen _ 604M Feb 22 16:54 rnabloom.longreads.corrected.e4.q3_max.fa
-rw-rw-r--. 1 ssen _ 163M Feb 22 16:54 rnabloom.longreads.corrected.e5.med_q3.fa
-rw-rw-r--. 1 ssen _ 109M Feb 22 16:54 rnabloom.longreads.corrected.e5.min_q1.fa
-rw-rw-r--. 1 ssen _ 152M Feb 22 16:54 rnabloom.longreads.corrected.e5.q1_med.fa
-rw-rw-r--. 1 ssen _ 130M Feb 22 16:54 rnabloom.longreads.corrected.e5.q3_max.fa
-rw-rw-r--. 1 ssen _ 34 Feb 26 22:38 rnabloom.ntcard.readslist.txt
-rw-rw-r--. 1 ssen _ 5.7G Feb 28 04:36 rnabloom.transcripts.fa
-rw-rw-r--. 1 ssen _ 84 Feb 26 22:38 STARTED
Yes, this looks better. The output assembly is at rnabloom.transcripts.fa
Hello, I have a question about the use case of RNA-Bloom. I have some Pac-bio CCS reads and nanopore long reads for a certain maize genotype which I am using for a de novo transcriptome assembly. I tried to do so but met with an "Exception in thread "main" java.lang.OutOfMemoryError: Java heap space" error. I did try to increase/ decrease the heap size and play around with memory settings on our cluster (it uses a slurm job manager)
I wonder if I am doing something wrong - in fact, I am not sure if I can only use long reads? or do I need rna-seq as well?