bcgsc / RNA-Bloom

:hibiscus: reference-free transcriptome assembly for short and long reads
Other
85 stars 7 forks source link

c=null in output FASTA #18

Closed schorlton closed 1 year ago

schorlton commented 2 years ago

Thanks for the amazing software! Ran some cDNA long-read sequencing through assembly: rnabloom -long input.fastq -ntcard -t 8 -outdir assembled

It all succeeded; however, when looking at my output data, I see that c=null in some of the FASTA sequence headers. I understood that the c= tag was meant to include coverage, and most of theses tags look correct!

Example:

>3 l=228 c=null s=8
GCTGAAAGCATCTAAGTGTGAAACCCACCTCAAGATGAGATTTCCCATGATTTTATATCAGTAAGACTATCCTCAGTGGGAAATCTGTCTTGCCCTCCCTCCCGGGACCCCCCTAGGCCCGCCCCGGCATTTATCCCTTCCCCCCCGGCGGAACAACGAGTACGCCGGCGGTAAATCCACTCTGTCCTCTCGCGCAAAACGGATCGGCCTCCGCCCGCGACGGATAGA

Please report

test$ java -version
openjdk version "11.0.9.1-internal" 2020-11-04
OpenJDK Runtime Environment (build 11.0.9.1-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 11.0.9.1-internal+0-adhoc..src, mixed mode)

test$ rnabloom -v
RNA-Bloom v1.4.3
Ka Ming Nip, Canada's Michael Smith Genome Sciences Centre, BC Cancer
Copyright 2018-present
kmnip commented 2 years ago

Hello @schorlton ,

Thanks for your interest in RNA-Bloom!

Yes, I am aware of this issue. Basically, these are singleton sequences that don't have any other reads assigned to it. In the next release, I will fix it so that it reports 0.0 instead of null. I am actually contemplating about removing these sequences altogether, depending on the threshold set for the -c option...

Thanks, Ka Ming

schorlton commented 2 years ago

Thanks for your quick response!

Would the coverage not be 1 if a single read led to this transcript? It would be odd to have coverage of 0...unless I misunderstand coverage in this context.

kmnip commented 2 years ago

Ideally speaking, you are correct; singleton sequences should have c=1. There is an issue with minimap2 where sequences with only high coverage minimizers in the reference are skipped. In this case, a sequence can have no reads mapped to it. There is an option in minimap2 that could be configured to potentially fix this issue. I am testing it now and see how it affects runtime and memory.

schorlton commented 2 years ago

I see. I don't know RNA-Bloom well enough, but yes - minimap2 is not designed (by default) for short sequences with high frequency minimizers. The -e flag may be what you're referring to, as it can recover more high frequency minimizers with fewer minimizer-free gaps. If the minimum transcript length is 200bp, you may need a really low -e to get 3 minimizers/transcript to form a chain...

Ebedthan commented 2 years ago

Hi, Same issue here.

kmnip commented 2 years ago

This will be fixed in the next release.

kmnip commented 1 year ago

This bug is fixed. Please see my new release of RNA-Bloom v2.0.0: https://github.com/bcgsc/RNA-Bloom/releases/tag/v2.0.0