bcgsc / RNA-Bloom

:hibiscus: reference-free transcriptome assembly for short and long reads
Other
85 stars 7 forks source link

Problem parsing fasta file #49

Closed JC-therea closed 1 year ago

JC-therea commented 1 year ago

Dear author of RNA-Bloom

I am using your software to assemble some direct RNA reads for different species however I am obtaining different errors in some of them.

Input file and command:

rnabloom -long $READS -stranded -t 8 -outdir $OUTDIR

The output that I get is the following:

RNA-Bloom v2.0.0
args: [-long, Input.fa, -stranded, -t, 8, -outdir, Output]

name:   rnabloom
outdir: Output

Turning on option `-ntcard` to count k-mers

K-mer counting with ntCard...
Running command: `ntcard -t 8 -k 25 -c 65535 -p Output/rnabloom @Output/rnabloom.ntcard.readslist.txt`...
Parsing histogram file `Output/rnabloom_k25.hist`...
Unique k-mers (k=25):     57,234,431
Unique k-mers (k=25,c>1): 11,105,972
K-mer counting completed in 21.059s

Bloom filters          Memory (GB)
====================================
de Bruijn graph:       0.12647936
k-mer counting:        0.19634001
====================================
Total:                 0.32281935

> Stage 1: Construct graph from reads (k=25)
Parsing `Input.fa`...
Parsed 477,740 sequences in 1m 0s
DBG Bloom filter FPR:                 1.06 %
Counting Bloom filter FPR:            1.17 %
> Stage 1 completed in 1m 1s

> Stage 2: Correct long reads for "rnabloom"
Parsing `Input.fa`...
Index -1 out of bounds for length 4
Index -1 out of bounds for length 4
java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 4
    at rnabloom.util.SeqUtils.isLowComplexityLong(SeqUtils.java:619)
    at rnabloom.util.SeqUtils.trimLowComplexityRegions(SeqUtils.java:848)
    at rnabloom.RNABloom$LongReadCorrectionWorker.run(RNABloom.java:3791)
    at java.base/java.lang.Thread.run(Thread.java:834)
java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 4
    at rnabloom.util.SeqUtils.isLowComplexityLong(SeqUtils.java:619)
    at rnabloom.util.SeqUtils.trimLowComplexityRegions(SeqUtils.java:848)
    at rnabloom.RNABloom$LongReadCorrectionWorker.run(RNABloom.java:3791)
    at java.base/java.lang.Thread.run(Thread.java:834)
null
java.lang.ArrayIndexOutOfBoundsException
null
java.lang.ArrayIndexOutOfBoundsException
null
java.lang.ArrayIndexOutOfBoundsException
null
java.lang.ArrayIndexOutOfBoundsException
Index -1 out of bounds for length 4
java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 4
    at rnabloom.util.SeqUtils.isLowComplexityLong(SeqUtils.java:605)
    at rnabloom.util.SeqUtils.trimLowComplexityRegions(SeqUtils.java:848)
    at rnabloom.RNABloom$LongReadCorrectionWorker.run(RNABloom.java:3791)
    at java.base/java.lang.Thread.run(Thread.java:834)
null
java.lang.ArrayIndexOutOfBoundsException
Corrected Read Lengths Sampling Distribution (n=4528)
    min q1  med q3  max
    239 776 1112    1635    5315
ERROR: null
java.lang.ArrayIndexOutOfBoundsException

Program version:

RNA-Bloom v2.0.0 openjdk version "11.0.1" 2018-10-16 LTS OpenJDK Runtime Environment Zulu11.2+3 (build 11.0.1+13-LTS) OpenJDK 64-Bit Server VM Zulu11.2+3 (build 11.0.1+13-LTS, mixed mode)

Any help that you can provide would be appreciated.

kmnip commented 1 year ago

Hi @JC-therea ,

Thanks for reporting this!

Your read file contains N characters in the sequences and RNA-Bloom currently doesn't work with reads containing non-ACGT characters.

A temporary solution is to simply cut your reads at Ns. You can do so easily with seqtk, e.g.

seqtk cutN -n 1 input.fa > input.noN.fa

RNA-Bloom should work fine when provided with these reads.

I will add support for N-containing reads in a future release.

Hope that helps, Ka Ming

kmnip commented 1 year ago

I am a bit curious in how these N characters arise. Did you pre-process your raw reads? like masking bases with low quality scores, etc.?

JC-therea commented 1 year ago

Thank you very much @kmnip! I did not expect to have Ns in my reads either... direct reads were merged with Illumina reads (with fmlrc) and after that, I applied a second error correction method with transcript clean. The Ns probably were introduced after that step because some of the genomes that I am using are hard masked...

Anyway, thank you very much for noticing that you helped me a lot!