A-pcd commented 1 month ago

I have an RNA seq that I need to assemble from PacBio and when I send the script the program do not pass the stage 2. I work with PACbio long read Hifi sequence that end up being all around 12 millions reads in Java openjdk version "22.0.1-internal" 2024-04-16. Here is the script I’m sending :

rnabloom -lrpb -stranded -long Denovo.fa -t 5 -stranded -outdir work/DEC1S -k 15 RNA-Bloom v2.0.1 args: [-lrpb, -long, Denovo.fa, -stranded, -t, 5, -outdir, /work /DEC1S, -k, 15]

name: rnabloom outdir: /work/DEC1S

Turning on option -ntcard to count k-mers K-mer counting with ntCard... Parsing histogram file /work/ DEC1S/rnabloom_k15.hist... Unique k-mers (k=15): 533,165,901 Unique k-mers (k=15,c>1): 529,810,636 K-mer counting completed in 0.156s

Bloom filters Memory (GB)

de Bruijn graph: 1.1782151 k-mer counting: 9.366405

Total: 10.54462

Stage 1: Construct graph from reads (k=15) Parsing Denovo.fa... Parsed 12,120,462 sequences in 1h 21m 4s DBG Bloom filter FPR: 3.55 % Counting Bloom filter FPR: 3.38 % Stage 1 completed in 1h 21m 16s

Stage 2: Correct long reads for "rnabloom" Parsing Denovo.fa... Corrected Read Lengths Sampling Distribution (n=10000) min q1 med q3 max 47 1273 1780 2386 13187 Parsed 12,120,462 sequences. Kept: 12,119,193 (100.0 %) Discarded: 1,269 (0.0105 %) Artifacts: 42,136 (0.34764352%) Corrected reads in 11h 7m 33s Extracting seed sequences... strobemers: n=3, k=11, wMin=12, wMax=61, depth=3 Bloom filter FPR: 84.6 % before: 12,542,091 after: 5,893,016 (47.0 %) too short: 0 Extraction completed in 1h 25m 13s Stage 2 completed in 12h 32m 47s

Stage 3: Assemble long reads for "rnabloom" Overlapping sequences... Parsed 0 overlap records in 4h 7m 4s total reads: 5,893,016

unique: 0 (0.0 %)

multi-seg: 0 Unique reads extracted in 1m 36s ERROR: Error extracting unique reads! ERROR: Error assembling long reads!

When I look at the 13 files that comes out:

LONGREADS.CORRECTED rnabloom.longreads.assembly1.nr.fa.gz.log rnabloom.longreads.corrected.polya.txt.gz STARTED rnabloom_k15.hist rnabloom.longreads.corrected.long.fa.gz rnabloom.longreads.corrected.repeats.fa.gz rnabloom_k15.hist.log rnabloom.longreads.corrected.long.lengths.txt rnabloom.longreads.corrected.short.fa.gz rnabloom.longreads.assembly1.nr.fa.gz rnabloom.longreads.corrected.long.seed.fa.gz rnabloom.ntcard.readslist.txt

The rnabloom.longreads.assembly1.nr.fa.gz is empty and the .log show me /

[M::mm_idx_gen::142.3652.34] collected minimizers [M::mm_idx_gen::183.8832.68] sorted minimizers [M::main::183.8832.68] loaded/built the index for 2647016 target sequence(s) [M::mm_mapopt_update::191.3482.61] mid_occ = 1793 [M::mm_idx_stat] kmer size: 19; skip: 5; is_hpc: 1; #seq: 2647016 [M::mm_idx_stat::194.689*2.59] distinct minimizers: 292680490 (35.10% are singletons); average occurrences: 6.355; average spacing: 4.301; total length: 8000271109

I still have more than 5 millions reads I for sure should be able to have some overlaps. Do you have any idea what could go wrong ?

kmnip commented 1 month ago

I think your machine did not have enough memory to overlap the reads with minimap2 in stage 3.

Also, I don't think PacBio Hifi reads are strand-specific. So, you shouldn't use the -strand option, and you actually specified the option twice. And, I would not change the k-mer size either because the Bloom filter false-positive rates are very high in stages 1 and 2.

So, please revise your command for RNA-Bloom to:

rnabloom -lrpb -long Denovo.fa -t 5 -outdir work/DEC1S

Please remove the output files and try the assembly again.

A-pcd commented 1 month ago

It do not even pass stage 1 when the k-mers size is not specified, it actually do not work with any k-mers above 15

RNA-Bloom v2.0.1 args: [-lrpb, -long, Denovo.fa, -t, 5, -outdir, /work/ /DEC1]

name: rnabloom outdir: /work/DEC1 WARNING: Output directory does not exist! Created output directory at /work/ DEC1

Turning on option -ntcard to count k-mers

K-mer counting with ntCard... Running command: ntcard -t 5 -k 35 -c 65535 -p /work/ DEC1/rnabloom @/work/DEC1/rnabloom.ntcard.readslist.txt... Parsing histogram file /work/DEC1/rnabloom_k35.hist... Unique k-mers (k=35): 27,673,140,848 Unique k-mers (k=35,c>1): 1,061,710,819 K-mer counting completed in 6m 43s

Bloom filters Memory (GB)

de Bruijn graph: 61.153416 k-mer counting: 18.76975

Total: 79.923164

Stage 1: Construct graph from reads (k=35) Exception in thread "main" java.lang.OutOfMemoryError: Unable to allocate 20153865520 bytes at java.base/jdk.internal.misc.Unsafe.allocateMemory(Unsafe.java:632) at jdk.unsupported/sun.misc.Unsafe.allocateMemory(Unsafe.java:462) at rnabloom.bloom.buffer.UnsafeByteBuffer.(UnsafeByteBuffer.java:44) at rnabloom.bloom.CountingBloomFilter.(CountingBloomFilter.java:52) at rnabloom.graph.BloomFilterDeBruijnGraph.(BloomFilterDeBruijnGraph.java:98) at rnabloom.RNABloom.initializeGraph(RNABloom.java:994) at rnabloom.RNABloom.main(RNABloom.java:7126)

kmnip commented 1 month ago

Exception in thread "main" java.lang.OutOfMemoryError: Unable to allocate 20153865520 bytes

Your machine does not have enough memory to do the assembly.

The previous assembly ran further with a smaller k-mer size (i.e. 15) because the Bloom filters required less memory. However, k=15 was not the appropriate k-mer size to use. Please use the parameters I recommended on a machine with more RAM. For your dataset, it would probably need ~100GB RAM.

bcgsc / RNA-Bloom

Error assembling long read PacBio, no overlaps #78

Bloom filters Memory (GB)

de Bruijn graph: 1.1782151 k-mer counting: 9.366405

Bloom filters Memory (GB)

de Bruijn graph: 61.153416 k-mer counting: 18.76975