bcgsc / RNA-Bloom

:hibiscus: reference-free transcriptome assembly for short and long reads
Other
85 stars 7 forks source link

RNA-Bloom Generates Empty FASTA Without Error #48

Closed schorlton closed 1 year ago

schorlton commented 1 year ago

As per title. Input file: test.fastq.gz

Command:

rnabloom -t 2 -outdir test_out -long test.fastq -ntcard

It should probably again report too little input data? Big thanks for all of your help!!



RNA-Bloom v2.0.0

java --version
openjdk 17.0.3-internal 2022-04-19
OpenJDK Runtime Environment (build 17.0.3-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 17.0.3-internal+0-adhoc..src, mixed mode, sharing)
kmnip commented 1 year ago

Thanks for reporting this! Yes, this happens when there are too few reads.

kmnip commented 1 year ago

I was able to replicate this, but this is not a bug. The assembled sequences are too short and they all end up in rnabloom.transcripts.short.fa (instead of rnabloom.transcripts.fa).

I have added a warning message for this scenario. The changes will be incorporated in the next release!

schorlton commented 1 year ago

What is the difference between these files besides above/below length threshold? Is there evidence that the longer transcripts are better supported/higher quality?

kmnip commented 1 year ago

Not at all. The length threshold is the only determining factor for assigning sequences to these two files.

schorlton commented 1 year ago

Not at all. The length threshold is the only determining factor for assigning sequences to these two files.

Cool. If that's the case, why separate the files at all? Why not have a single assembly output file, with an optional param to filter contigs shorter than x length, with default x=0?

kmnip commented 1 year ago

There is already an option for that (i.e. -length) and its default value is 200, which is what separates the sequences in the two files. All RNA-seq assemblers I can think of have a similar length cutoff option and its default is 100~200 nt. It is not set to zero because very short sequences can potentially be noise.

schorlton commented 1 year ago

Thanks for explaining. Contrary to your earlier answer then, it does sound like there is evidence that the longer transcripts are likely higher quality. I guess a warning message will suffice if the non-short transcripts file is empty. Thanks again!

kmnip commented 1 year ago

Sorry, I thought you were asking whether RNA-Bloom use any evidence to determine that threshold.