bcgsc / RNA-Bloom

:hibiscus: reference-free transcriptome assembly for short and long reads
Other
93 stars 7 forks source link

Question: De-novo assembly of long reads ONLY - Java memory error #2

Closed sid5427 closed 4 years ago

sid5427 commented 4 years ago

Hello, I have a question about the use case of RNA-Bloom. I have some Pac-bio CCS reads and nanopore long reads for a certain maize genotype which I am using for a de novo transcriptome assembly. I tried to do so but met with an "Exception in thread "main" java.lang.OutOfMemoryError: Java heap space" error. I did try to increase/ decrease the heap size and play around with memory settings on our cluster (it uses a slurm job manager)

I wonder if I am doing something wrong - in fact, I am not sure if I can only use long reads? or do I need rna-seq as well?

kmnip commented 4 years ago

Hi,

Thanks using RNA-Bloom!

I have several questions for you that would help me trouble-shoot the problem:

  1. How many CCS and nanopore reads are there in your sample?
  2. Please report your exact command for RNA-Bloom.
  3. Please report the last ~10 lines of log messages before the memory error.

The final stage of long-read assembly in RNA-Bloom can use quite a bit of memory for large samples (~6 million reads in my human direct RNA dataset). I do have a fix for this to be included in the next version, which is coming later this week.

When you encounter a Java heap space error, you would have to increase the JVM heap space (-Xmx) and increase the memory allocation for your slurm job.

You don't need to provide short RNA-seq reads for long read assembly, but the short reads certainly do help with the initial error correction stage.

To use both short reads and long reads:

java -jar RNA-Bloom.jar -left short_reads_1.fastq -right short_reads_2.fq -rcr -long nanopore_reads.fq ...

Thanks! Ka Ming

sid5427 commented 4 years ago

Hi Ka Ming, Thanks for replying -

I have 131585 pacbio CCS reads & 8656693 nanopore reads(including smaller reads). At this time, I have only tried running RNAbloom with nanopore reads. Unfortunately, after checking my logs, it seems I had run it with all reads including smaller ones which I should have filtered out - I should have used reads larger than 200bp (or maybe 1000bp+).

Command used to run RNAbloom

(srun --mem-per-cpu=16G --job-name=RNA_bloom -c 8 -p <partition> -A <labname> -t 2-00:00 \
java -jar ../RNA-Bloom.jar -long nanopore_file.fastq \
-ntcard -t 8 -outdir rnabloom_assembly) >& rnabloom_assembly/rnabloom_run_1.txt &

Last 20 lines of log file

Assembling cluster `21533`...
Assembling cluster `21534`...
Assembling cluster `21535`...
Assembling cluster `21536`...
Assembling cluster `21537`...
Overlapped sequences: 2,188,104
          - unique:   2,087,630
          - dovetail: 603,728
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at java.base/java.util.BitSet.initWords(BitSet.java:167)
        at java.base/java.util.BitSet.<init>(BitSet.java:162)
        at org.jgrapht.alg.TransitiveReduction.reduce(TransitiveReduction.java:145)
        at rnabloom.olc.Layout.layoutBackbones(Layout.java:480)
        at rnabloom.olc.Layout.writeBackboneSequences(Layout.java:338)
        at rnabloom.olc.OverlapLayoutConcensus.layout(OverlapLayoutConcensus.java:169)
        at rnabloom.olc.OverlapLayoutConcensus.overlapLayoutConcensus(OverlapLayoutConcensus.java:296)
        at rnabloom.RNABloom.assembleLongReads(RNABloom.java:2447)
        at rnabloom.RNABloom.assembleLongReads(RNABloom.java:3773)
        at rnabloom.RNABloom.main(RNABloom.java:5088)
srun: error: lewis4-r630-htc4-node285: task 0: Exited with exit code 1

I did try to increase the heapsize, but apparently the heapsize is already set at max for our cluster or probably not accessible to me, which is strange as it should be user accessible - have done it before on our lab servers, and my laptop.

Any help with this issue would be appreciated! ---Sidharth Sen

kmnip commented 4 years ago

Thanks for reporting these details.

The main issue is there is one cluster (ie. "21537") that has a large number of reads (ie. 2.19 million reads), causing the memory usage to blow up. Adjusting the JVM heap size won't help.

I will make a new release of RNA-Bloom by the end of this week. You can possibly try the new version then. I will keep you posted. Thank you!

kmnip commented 4 years ago

Hi @sid5427,

I have made a new release: https://github.com/bcgsc/RNA-Bloom/releases/tag/v1.2.2 Let me know if that helps with the memory issue.

Ka Ming

sid5427 commented 4 years ago

Hi Ka Ming, Awesome - I was anyway going to try a run after removing sequences less than 200bp. Now I'll update RNA-bloom and report back the results. Thanks for the quick turn around on this.

sid5427 commented 4 years ago

Hi Ka Ming, So I finally got back to running RNA-bloom on my dataset. I used the updated version and also updated all the other dependencies just to be sure. Unfortunately, it seems to be having another issue, different from last time I think.

I used the same command, settings and nanopore file (cDNA) as before. Please do note, the nanopore file is for a maize transcriptome, so quite large.

last 20 lines of the log file (the last line repeats about 200 times.)-->

> Stage 4: Assemble long reads for "rnabloom"
Total of 810306 clusters to be assembled
Assembling cluster `0`...
Overlapped sequences: 4,588,506
         - artifacts: 6
         - unique:    4,404,862
         - dovetail:  45,229
G: |V|=90,458 |E|=2,069,428
Exception in thread "main" java.lang.StackOverflowError
    at java.base/java.util.HashMap.putVal(HashMap.java:643)
    at java.base/java.util.HashMap.put(HashMap.java:607)
    at org.jgrapht.alg.connectivity.BiconnectivityInspector.dfs(BiconnectivityInspector.java:304)
    at org.jgrapht.alg.connectivity.BiconnectivityInspector.dfs(BiconnectivityInspector.java:315)
    at org.jgrapht.alg.connectivity.BiconnectivityInspector.dfs(BiconnectivityInspector.java:315)
    at org.jgrapht.alg.connectivity.BiconnectivityInspector.dfs(BiconnectivityInspector.java:315)
    at org.jgrapht.alg.connectivity.BiconnectivityInspector.dfs(BiconnectivityInspector.java:315)
    at org.jgrapht.alg.connectivity.BiconnectivityInspector.dfs(BiconnectivityInspector.java:315)
    at org.jgrapht.alg.connectivity.BiconnectivityInspector.dfs(BiconnectivityInspector.java:315)
    at org.jgrapht.alg.connectivity.BiconnectivityInspector.dfs(BiconnectivityInspector.java:315)

full log file to download --> https://1drv.ms/t/s!Alr23pNlJf37isFR_8SobzqMiuA9sQ?e=HrHYcl

Any help with this is appreciated! --Sidharth Sen

kmnip commented 4 years ago

Hi Sidharth,

Thanks for reporting the error! I also encountered the same error for my plant dataset. I have a fix for it, please use the latest release: https://github.com/bcgsc/RNA-Bloom/releases/tag/v1.2.3

Thanks, Ka Ming

sid5427 commented 4 years ago

Hi Ka Ming, Awesome - I'll give it a try asap. I checked bioconda's repo site - It's not yet updated to the latest version. Could you please push it to there as well? https://anaconda.org/bioconda/rnabloom

Thanks & Regards ----Sidharth Sen

kmnip commented 4 years ago

The automatic pull-request usually takes a while after a fresh release, but I confirm that it has been updated.

sid5427 commented 4 years ago

Hi Ka Ming, I can confirm it has updated as well, and I have already started a fresh run with the updated tool. Will report back how it goes. Thanks for your help --Sidharth Sen

SophieS9 commented 4 years ago

Hi Ka Ming and Sidharth,

I've been folllowing this thread as I was also experiencing the same memory issue that Sidharth reported at the beginning of the thread when using RNA-bloom version 1.2.0. I've tried increasing my heapsize up to 1Tb and I've filtered out all of the small reads, but still encountering this issue. Normally the software runs for about 8 days before dying with the following error (end of log file). I've checked the cluster usage it is running out of memory:

 Assembling cluster `15241`...
Overlapped sequences: 2,823,951
          - unique:   2,341,740
          - dovetail: 3,410,262
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.base/java.util.BitSet.initWords(BitSet.java:167)
    at java.base/java.util.BitSet.<init>(BitSet.java:162)
    at org.jgrapht.alg.TransitiveReduction.reduce(TransitiveReduction.java:145)
    at rnabloom.olc.Layout.layoutBackbones(Layout.java:480)
    at rnabloom.olc.Layout.writeBackboneSequences(Layout.java:338)
    at rnabloom.olc.OverlapLayoutConcensus.layout(OverlapLayoutConcensus.java:169)
    at rnabloom.olc.OverlapLayoutConcensus.overlapLayoutConcensus(OverlapLayoutConcensus.java:296)
    at rnabloom.RNABloom.assembleLongReads(RNABloom.java:2447)
    at rnabloom.RNABloom.assembleLongReads(RNABloom.java:3773)
    at rnabloom.RNABloom.main(RNABloom.java:5088)

I've also now tried to run version 1.2.3 as suggested, but this is also failing. However, this fails after only 30 minutes and it's not running out of memory (only 6G used out of a possible 1T). As a note I'm using JDK version 12.0.1. Here is the log file:

RNA-Bloom v1.2.3
args: [-long, Basecalled_Final/Filter.fq.gz, -outdir, RNA_Bloom_v1.2.3/, -threads, 8]

name:   rnabloom
outdir: RNA_Bloom_v1.2.3/
Min k-mer coverage threshold: 3

Bloom filters          Memory (GB)
====================================
de Bruijn graph:       0.8594979
k-mer counting:        3.4379916
====================================
Total:                 4.2974896

> Stage 1: Construct graph from reads (k=17)
[1] Parsing `Basecalled_Final/Filter.fq.gz`...
[1] Parsed 9,086,352 sequences.
Parsed 9,086,352 reads in total.
DBG Bloom filter FPR:                 8.421373 %
Counting Bloom filter FPR:            4.02473 %
WARNING: Bloom filter FPR is higher than the maximum allowed FPR (1.0%)!
Adjusting Bloom filter sizes...
ERROR: null
java.lang.NullPointerException
    at rnabloom.RNABloom.getOptimalBloomFilterSizes(RNABloom.java:786)
    at rnabloom.RNABloom.main(RNABloom.java:5589)

Any ideas!?

All the Best,

Sophie

kmnip commented 4 years ago

Hi Sophie,

The OutOfMemoryError and StackOverflowError both came from one module of the graph library I was using. I gave up on that module and I implemented my own solution in v1.2.3 that is faster and uses a lot less memory. So, you shouldn't have issues in that part of the assembly.

The NullPointerException is a bug. Thanks for reporting it, I will fix it in the next release.

To bypass the error, you can either set the appropriate Bloom filter size by using the -ntcard option, ie. -ntcard -long Basecalled_Final/Filter.fq.gz -outdir RNA_Bloom_v1.2.3/ -threads 8

or if you don't have ntCard installed, simply allocate more memory for the Bloom filters, ie. -mem 16 -long Basecalled_Final/Filter.fq.gz -outdir RNA_Bloom_v1.2.3/ -threads 8

Let me know if that works for you.

Thanks! Ka Ming

sid5427 commented 4 years ago

Hi Ka Ming.

I have good news, the assembly finished and it terminated successfully. I can see different fasta output files as well.

Assembling cluster `810283`...
Assembling cluster `810284`...
Assembling cluster `810285`...
Assembling cluster `810286`...
Assembling cluster `810287`...
Assembling cluster `810288`...
Assembling cluster `810289`...
Assembling cluster `810290`...
Assembling cluster `810291`...
Assembling cluster `810292`...
Assembling cluster `810293`...
Assembling cluster `810294`...
Assembling cluster `810295`...
Assembling cluster `810296`...
Assembling cluster `810297`...
Assembling cluster `810298`...
Assembling cluster `810299`...
Assembling cluster `810300`...
Assembling cluster `810301`...
Assembling cluster `810302`...
Assembling cluster `810303`...
Assembling cluster `810304`...
Assembling cluster `810305`...
Inter-cluster assembly...
Overlapped sequences: 4,813,477
         - artifacts: 22
         - unique:    3,444,773
         - dovetail:  96,924
G: |V|=193,848 |E|=309,726
G: |V|=193,848 |E|=66,088
before: 8,252,969       after: 6,851,221
> Stage 4 completed in 29h 57m 49s
Total runtime: 29h 57m 49s

Sophie - I had allocated a large amount of ram to my job - about 480gb - maxing out the available memory on one node of our cluster - that seemed to have helped along with the fixes ka ming had pushed out earlier.

Ka Ming - since this is my first time working with rnabloom - what output files should I be using for downstream analysis - something similar to what we would do for RNA-seq assembly and then evaluation with BUSCO, rnaQUAST or hisat2, etc?

these are the files that I have as output - I am a bit worried that the LONGREADS.CORRECTED file is empty and there are no files inside rnabloom.longreads.clusters directory.

-rw-rw-r--. 1 ssen _    0 Feb 26 15:43 LONGREADS.CORRECTED
-rw-rw-r--. 1 ssen _ 510K Feb 26 10:59 rnabloom_k17.hist
drwxrwsr-x. 2 ssen _    2 Feb 26 15:43 rnabloom.longreads.clusters
-rw-rw-r--. 1 ssen _  51M Feb 26 15:43 rnabloom.longreads.corrected.e0.med_q3.fa
-rw-rw-r--. 1 ssen _ 9.0M Feb 26 15:43 rnabloom.longreads.corrected.e0.min_q1.fa
-rw-rw-r--. 1 ssen _  23M Feb 26 15:43 rnabloom.longreads.corrected.e0.q1_med.fa
-rw-rw-r--. 1 ssen _ 161M Feb 26 15:43 rnabloom.longreads.corrected.e0.q3_max.fa
-rw-rw-r--. 1 ssen _ 8.7M Feb 26 15:43 rnabloom.longreads.corrected.e1.med_q3.fa
-rw-rw-r--. 1 ssen _ 3.1M Feb 26 15:43 rnabloom.longreads.corrected.e1.min_q1.fa
-rw-rw-r--. 1 ssen _ 5.5M Feb 26 15:43 rnabloom.longreads.corrected.e1.q1_med.fa
-rw-rw-r--. 1 ssen _  35M Feb 26 15:43 rnabloom.longreads.corrected.e1.q3_max.fa
-rw-rw-r--. 1 ssen _  12M Feb 26 15:43 rnabloom.longreads.corrected.e2.med_q3.fa
-rw-rw-r--. 1 ssen _ 5.0M Feb 26 15:43 rnabloom.longreads.corrected.e2.min_q1.fa
-rw-rw-r--. 1 ssen _ 6.8M Feb 26 15:43 rnabloom.longreads.corrected.e2.q1_med.fa
-rw-rw-r--. 1 ssen _  65M Feb 26 15:43 rnabloom.longreads.corrected.e2.q3_max.fa
-rw-rw-r--. 1 ssen _  79M Feb 26 15:43 rnabloom.longreads.corrected.e3.med_q3.fa
-rw-rw-r--. 1 ssen _  37M Feb 26 15:43 rnabloom.longreads.corrected.e3.min_q1.fa
-rw-rw-r--. 1 ssen _  51M Feb 26 15:43 rnabloom.longreads.corrected.e3.q1_med.fa
-rw-rw-r--. 1 ssen _ 589M Feb 26 15:43 rnabloom.longreads.corrected.e3.q3_max.fa
-rw-rw-r--. 1 ssen _ 465M Feb 26 15:43 rnabloom.longreads.corrected.e4.med_q3.fa
-rw-rw-r--. 1 ssen _ 134M Feb 26 15:43 rnabloom.longreads.corrected.e4.min_q1.fa
-rw-rw-r--. 1 ssen _ 255M Feb 26 15:43 rnabloom.longreads.corrected.e4.q1_med.fa
-rw-rw-r--. 1 ssen _ 2.1G Feb 26 15:43 rnabloom.longreads.corrected.e4.q3_max.fa
-rw-rw-r--. 1 ssen _ 914M Feb 26 15:43 rnabloom.longreads.corrected.e5.med_q3.fa
-rw-rw-r--. 1 ssen _ 193M Feb 26 15:43 rnabloom.longreads.corrected.e5.min_q1.fa
-rw-rw-r--. 1 ssen _ 461M Feb 26 15:43 rnabloom.longreads.corrected.e5.q1_med.fa
-rw-rw-r--. 1 ssen _ 1.9G Feb 26 15:43 rnabloom.longreads.corrected.e5.q3_max.fa
-rw-rw-r--. 1 ssen _   34 Feb 26 10:56 rnabloom.ntcard.readslist.txt
-rw-rw-r--. 1 ssen _   83 Feb 26 10:59 STARTED

(I can open another ticket to keep this thread clean if that's how you want to manage the topics...)

kmnip commented 4 years ago

It's fine, no need to open another issue. :)

The final output file for your assembly should be rnabloom.transcripts.fa.

Those empty files with names in capital letters (ie. DBG.DONE, LONGREADS.ASSEMBLED, LONGREADS.CLUSTERED, LONGREADS.CORRECTED) are completion stamps for different stage of the assembly. For example, LONGREADS.ASSEMBLED would indicate that everything has completed successfully.

Would you be able to find out the peak rss (memory usage) for your cluster job?

sid5427 commented 4 years ago

Hi Ka Ming,

job name - 
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
20170408     RNA_bloom+ BioCompute      xulab         32  COMPLETED      0:0

So according to slurm, the job was completed..

I was able to retrieve these stats using the "seff" command in slurm -

Job ID: 20170408
Cluster: lewis4
User/Group: ssen/ssen
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 32
CPU Utilized: 23-03:15:26
CPU Efficiency: 57.88% of 39-23:15:12 core-walltime
Memory Utilized: 275.45 GB
Memory Efficiency: 57.39% of 480.00 GB

Unfortunately, I do not see any of the file names with the capital letters except for LONGREADS.CORRECTED No rnabloom.transcripts.fa either - assuming this would be in the same output folder as specified

Any clues as to what might have happened?

Thanks & Regards -- Sidharth Sen

kmnip commented 4 years ago

Thanks!

That's weird. Can you please check your output directory at the top of your cluster log? For example:

RNA-Bloom v1.2.3
args: [-t, 12, -fpr, 0.005, -outdir, /path/to/my/outdir, -long, /path/to/my/reads.fa.gz]

name:   rnabloom
outdir: /path/to/my/outdir
...
sid5427 commented 4 years ago

Oh Darn it! Wrong output folder...good catch! sorry about that..

These are the correct output files -

-rw-rw-r--. 1 ssen _    0 Feb 28 04:36 LONGREADS.ASSEMBLED
-rw-rw-r--. 1 ssen _    0 Feb 25 05:08 LONGREADS.CLUSTERED
-rw-rw-r--. 1 ssen _    0 Feb 22 16:54 LONGREADS.CORRECTED
-rw-rw-r--. 1 ssen _ 510K Feb 22 12:45 rnabloom_k17.hist
drwxrwsr-x. 2 ssen _ 1.6M Feb 27 14:47 rnabloom.longreads.assembly
drwxrwsr-x. 2 ssen _ 792K Feb 25 05:08 rnabloom.longreads.clusters
-rw-rw-r--. 1 ssen _  64M Feb 22 16:54 rnabloom.longreads.corrected.e0.med_q3.fa
-rw-rw-r--. 1 ssen _ 9.2M Feb 22 16:54 rnabloom.longreads.corrected.e0.min_q1.fa
-rw-rw-r--. 1 ssen _  26M Feb 22 16:54 rnabloom.longreads.corrected.e0.q1_med.fa
-rw-rw-r--. 1 ssen _ 268M Feb 22 16:54 rnabloom.longreads.corrected.e0.q3_max.fa
-rw-rw-r--. 1 ssen _ 116M Feb 22 16:54 rnabloom.longreads.corrected.e1.med_q3.fa
-rw-rw-r--. 1 ssen _  11M Feb 22 16:54 rnabloom.longreads.corrected.e1.min_q1.fa
-rw-rw-r--. 1 ssen _  39M Feb 22 16:54 rnabloom.longreads.corrected.e1.q1_med.fa
-rw-rw-r--. 1 ssen _ 622M Feb 22 16:54 rnabloom.longreads.corrected.e1.q3_max.fa
-rw-rw-r--. 1 ssen _ 345M Feb 22 16:54 rnabloom.longreads.corrected.e2.med_q3.fa
-rw-rw-r--. 1 ssen _  45M Feb 22 16:54 rnabloom.longreads.corrected.e2.min_q1.fa
-rw-rw-r--. 1 ssen _ 137M Feb 22 16:54 rnabloom.longreads.corrected.e2.q1_med.fa
-rw-rw-r--. 1 ssen _ 1.5G Feb 22 16:54 rnabloom.longreads.corrected.e2.q3_max.fa
-rw-rw-r--. 1 ssen _ 525M Feb 22 16:54 rnabloom.longreads.corrected.e3.med_q3.fa
-rw-rw-r--. 1 ssen _ 100M Feb 22 16:54 rnabloom.longreads.corrected.e3.min_q1.fa
-rw-rw-r--. 1 ssen _ 244M Feb 22 16:54 rnabloom.longreads.corrected.e3.q1_med.fa
-rw-rw-r--. 1 ssen _ 1.7G Feb 22 16:54 rnabloom.longreads.corrected.e3.q3_max.fa
-rw-rw-r--. 1 ssen _ 317M Feb 22 16:54 rnabloom.longreads.corrected.e4.med_q3.fa
-rw-rw-r--. 1 ssen _ 111M Feb 22 16:54 rnabloom.longreads.corrected.e4.min_q1.fa
-rw-rw-r--. 1 ssen _ 194M Feb 22 16:54 rnabloom.longreads.corrected.e4.q1_med.fa
-rw-rw-r--. 1 ssen _ 604M Feb 22 16:54 rnabloom.longreads.corrected.e4.q3_max.fa
-rw-rw-r--. 1 ssen _ 163M Feb 22 16:54 rnabloom.longreads.corrected.e5.med_q3.fa
-rw-rw-r--. 1 ssen _ 109M Feb 22 16:54 rnabloom.longreads.corrected.e5.min_q1.fa
-rw-rw-r--. 1 ssen _ 152M Feb 22 16:54 rnabloom.longreads.corrected.e5.q1_med.fa
-rw-rw-r--. 1 ssen _ 130M Feb 22 16:54 rnabloom.longreads.corrected.e5.q3_max.fa
-rw-rw-r--. 1 ssen _   34 Feb 26 22:38 rnabloom.ntcard.readslist.txt
-rw-rw-r--. 1 ssen _ 5.7G Feb 28 04:36 rnabloom.transcripts.fa
-rw-rw-r--. 1 ssen _   84 Feb 26 22:38 STARTED
kmnip commented 4 years ago

Yes, this looks better. The output assembly is at rnabloom.transcripts.fa