bcgsc / RNA-Bloom

:hibiscus: reference-free transcriptome assembly for short and long reads
Other
96 stars 7 forks source link

Q. Error in extracting unique reads at Stage 3 (Assembling long reads for "rnabloom") #56

Closed alexander-ahn closed 1 year ago

alexander-ahn commented 1 year ago

Hello, I can't seem to go past stage 2 whenever I try to use RNA-Bloom2 with ONT long-read RNA-Seq data.

RNA-Bloom2 Input CMD:

#n.b.1, `.` directory contains combined reads of ONT long-read RNA-Seq FASTQ files and output directory "rnabloom_assembly"
#n.b.2, `ntCard` and `racon` were both present in /my_path/ environment to run RNA-Bloom2
$ java -jar /my_path/RNA-Bloom_v2.0.1/RNA-Bloom.jar -version
RNA-Bloom v2.0.1
Ka Ming Nip, Canada's Michael Smith Genome Sciences Centre, BC Cancer
Copyright 2018-present
$ java -version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21)
OpenJDK 64-Bit Server VM JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21, mixed mode)
$ java -jar /my_path/RNA-Bloom_v2.0.1/RNA-Bloom.jar -long ./*.fastq -t 36 -outdir ./rnabloom_assembly/

CMD log & error prompts:

name:   rnabloom
outdir: ./rnabloom_assembly/

Turning on option `-ntcard` to count k-mers

K-mer counting with ntCard...
Running command: `ntcard -t 36 -k 25 -c 65535 -p ./rnabloom_assembly//rnabloom @./rnabloom_assembly//rnabloom.ntcard.readslist.txt`...
Parsing histogram file `./rnabloom_assembly//rnabloom_k25.hist`...
Unique k-mers (k=25):     693,428,654
Unique k-mers (k=25,c>1): 98,918,078
K-mer counting completed in 6.905s

Bloom filters          Memory (GB)
====================================
de Bruijn graph:       1.5323714
k-mer counting:        1.7487507
====================================
Total:                 3.2811222

> Stage 1: Construct graph from reads (k=25)
Parsing `./pass_barcode01_0.fastq`...
Parsed 4,000 sequences in 0.467s
Parsing `./pass_barcode01_1.fastq`...
Parsed 4,000 sequences in 0.451s
Parsing `./pass_barcode01_2.fastq`...
Parsed 4,000 sequences in 0.147s
... etc ...
Parsed 6,183,653 sequences from 1552 files.
DBG Bloom filter FPR:                 0.721 %
Counting Bloom filter FPR:            0.8 %
> Stage 1 completed in 4m 22s

> Stage 2: Correct long reads for "rnabloom"
Parsing `./pass_barcode01_0.fastq`...
Parsing `./pass_barcode01_1.fastq`...
Parsing `./pass_barcode01_2.fastq`...
Corrected Read Lengths Sampling Distribution (n=10000)
    min q1  med q3  max
    46  215 298 462 2916
Parsing `./pass_barcode01_3.fastq`...
Parsing `./pass_barcode01_4.fastq`...
Parsing `./pass_barcode01_5.fastq`...
... etc ...
Parsed 6,183,653 sequences.
    Kept:      6,181,644    (100.0 %)
    Discarded: 2,009    (0.0325 %)
    Artifacts: 23,076   (0.37317747%)
Corrected reads in 3m 51s
Extracting seed sequences...
strobemers: n=3, k=11, wMin=12, wMax=61, depth=3
Bloom filter FPR:   5.51 %
before: 6,073,495   after: 1,485,544 (24.5 %)
too short: 0
Extraction completed in 11m 14s
> Stage 2 completed in 15m 6s

> Stage 3: Assemble long reads for "rnabloom"
Overlapping sequences...
Parsed 0 overlap records in 0.003s
total reads:    1,485,544
 - unique:      0   (0.0 %)
   - multi-seg: 0
Unique reads extracted in 5.006s
ERROR: Error extracting unique reads!
ERROR: Error assembling long reads!

output directory:

$ ll rnabloom_assembly/
total 1193474
-rw-rw-r-- 1 ~ ~~         0 Aug 25 09:16 LONGREADS.CORRECTED
-rw-rw-r-- 1 ~ ~~    517516 Aug 25 08:55 rnabloom_k25.hist
-rw-rw-r-- 1 ~ ~~        21 Aug 25 08:55 rnabloom_k25.hist.log
-rw-rw-r-- 1 ~ ~~        20 Aug 25 09:19 rnabloom.longreads.assembly1.nr.fa.gz
-rw-rw-r-- 1 ~ ~~        31 Aug 25 09:19 rnabloom.longreads.assembly1.nr.fa.gz.log
-rw-rw-r-- 1 ~ ~~ 956597733 Aug 25 09:04 rnabloom.longreads.corrected.long.fa.gz
-rw-rw-r-- 1 ~ ~~     40421 Aug 25 09:04 rnabloom.longreads.corrected.long.lengths.txt
-rw-rw-r-- 1 ~ ~~ 253420242 Aug 25 09:16 rnabloom.longreads.corrected.long.seed.fa.gz
-rw-rw-r-- 1 ~ ~~   4358775 Aug 25 09:04 rnabloom.longreads.corrected.polya.txt.gz
-rw-rw-r-- 1 ~ ~~    172811 Aug 25 09:04 rnabloom.longreads.corrected.repeats.fa.gz
-rw-rw-r-- 1 ~ ~~   6566759 Aug 25 09:04 rnabloom.longreads.corrected.short.fa.gz
-rw-rw-r-- 1 ~ ~~     82849 Aug 25 09:19 rnabloom.ntcard.readslist.txt
-rw-rw-r-- 1 ~ ~~     82889 Aug 25 09:19 STARTED

Thanks.

kmnip commented 1 year ago
> Stage 3: Assemble long reads for "rnabloom"
Overlapping sequences...
Parsed 0 overlap records in 0.003s

That overlapping step in stage 3 shouldn't run this quickly. Can you check the content of the file rnabloom.longreads.assembly1.nr.fa.gz.log?

Also, what kind of transcriptome are you assembling? The reads seem very short.

alexander-ahn commented 1 year ago

The file simply says:

[ERROR] unknown option in "-e"

It's supposedly untargeted poly(A)-containing mRNA transcriptome from non-human primate samples.

kmnip commented 1 year ago

I think you could be using an old version of minimap2 that does not support the -e option. If that is the case, please update to the latest version.

alexander-ahn commented 1 year ago

Yes, that was indeed the problem (accidentally used minimap2.14 instead of >=2.22). Thanks once again!