bcgsc / RNA-Bloom

:hibiscus: reference-free transcriptome assembly for short and long reads
Other
96 stars 7 forks source link

Transcript headers follow different formats #46

Closed schorlton closed 1 year ago

schorlton commented 2 years ago

Please report

Trying to run RNA-Bloom indiscriminately on input files to see if they assemble. I don't check the files before as I want to leave it to RNA-Bloom to decide if it can assemble anything. Interestingly, RNA-Bloom produces different header formats in FASTA for different outputs.

Sometimes I get: >3 l=228 c=1.1 s=8 other times I get: >s1

Note that these are with different inputs. Is it possible to output the same header format each time? In the latter format, does coverage=1?

Thanks!!

RNA-Bloom v2.0.0

java --version
openjdk 17.0.3-internal 2022-04-19
OpenJDK Runtime Environment (build 17.0.3-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 17.0.3-internal+0-adhoc..src, mixed mode, sharing)

Command:

rnabloom -outdir rnabloom_out -t 8 -long input.fastq -ntcard

Sample input read to reproduce single-element header:

@read1
AATTTGGGTGTTTAACCAGTCATCGCCTACCGTGACTTCGGATTCATCGTGTTTCGTTTTCGTGCGCCGCTTCAACATGGGGCTAATCATTGCTTTCGTGCGCCATTCAACATGGAATAATCATTGCTTTTTCGTGCGCCGCTTCAACATGGGGGGCCACGCGCGCGTCCCCCGAAGGCGCGTAACGCTGTGGCGGCCTGCTT
+
%*'('((,./;:3,''%%&#$%(*$$&(*-30441004/*.1110)*.06{?;?<)57??@76341{9334?C9B@:999JA?;88<@::7610/--+224.,,'&&''-612105'&&,127<<820.-:::34475{;545-?8454;==??8877...F{{{{<//101/.*,/12{{1.'&&$$$$%$'('''$%&&&'
kmnip commented 2 years ago

Hi @schorlton,

Are you seeing different FASTA header formats in the final output (i.e. rnabloom.transcripts.fa) of different assemblies? Or, you mean different output FASTA files from the same assembly have different FASTA header formats?

If it is the latter, then it is actually intentional.

Ka Ming

schorlton commented 2 years ago

Are you seeing different FASTA header formats in the final output (i.e. rnabloom.transcripts.fa) of different assemblies?

Yes this. Different reads used as input leads to differently formatted FASTA headers. Sorry that wasn't clear. I like the

 >3 l=228 c=1.1 s=8

header format as I use the coverage and length information. However, not all transcripts have this information in the header, eg. if you run RNA-Bloom on the example read above, you'll only get a FASTA header with a sequence identifier, no coverage or length information.

kmnip commented 2 years ago

Ah, ok. The reason why you see this header style in some but not others is because some assemblies may have ended at an earlier stage.

To resolve this issue, I will try to standardize the final output FASTA regardless of the assembly endpoint.