bcgsc / RNA-Bloom

:hibiscus: reference-free transcriptome assembly for short and long reads
Other
85 stars 7 forks source link

Transcriptome assembly from ONT direct RNA sequencing seems to be incomplete #57

Open OliviaAgatha opened 10 months ago

OliviaAgatha commented 10 months ago

After running Oxford nanopore direct RNA sequencing with total RNA as input, I preprocessed the data through fastp and then input the result for assembly with RNAbloom2 with the following command: rnabloom -long /path/file.fastq -outdir /path

All the other parameters are set as default.

I ran busco for the result of RNAbloom to see the completeness of the transcriptome but the completeness is below 10%.

What is your advice on this matter?

Thank you.

kmnip commented 10 months ago

Several questions:

  1. Which version of RNA-Bloom did you use?
  2. What was your command to fastp? I think fastp is more oriented towards short Illumina reads.
  3. How many reads are there before and after running fastp? I wonder if too many reads are discarded and over-trimmed.
  4. What is the pore chemistry for your ONT flowcell? (e.g. R9.4, R10. etc.)

Typically, I do not see adaptors in ONT direct RNA reads. So, I would use use the raw reads for assembly.

OliviaAgatha commented 10 months ago

Hi kmnip,

  1. I ran RNAbloom version 2.0.1 with the data even before fastp and after fastp. Both BUSCO completeness scores are below 10%.
  2. I used fastp -i input1.fq.gz -o output1.fq.gz
  3. Before running fastp, 536000 reads. After fastp:453509 reads
  4. R9 version

In this case, what would you advise? Thank you.

kmnip commented 10 months ago

With R9, your direct RNA-seq reads would have a high error rate. With just half-million reads, there isn't a lot of error correction or polishing that can be done using long reads alone.

A few more questions:

  1. What kind of species are you assembling?
  2. What was your command for running BUSCO?
  3. Can you please check the BUSCO completeness of your raw reads?
OliviaAgatha commented 10 months ago

In this case, as I have data from the short reads, would you suggest using the short read polishing?

Would the command be as following?

java -jar RNA-Bloom.jar -stranded -long LONG.fastq -sef SHORT_FORWARD.fastq -ser SHORT_REVERSE.fastq -t THREADS -outdir OUTDIR

As I have several sef and ser's fastq files, how would you suggest going forward?

I would also check the BUSCO completeness of the raw reads.

Thank you!

kmnip commented 10 months ago

Yes, please include the short reads, which should improve the error correction step.

Your command looks correct. You can include multiple read file paths separated by the whitespace character, e.g.

java -jar RNA-Bloom.jar -stranded \
-long LONG_1.fastq LONG_2.fastq \
-sef SHORT_FORWARD_1.fastq SHORT_FORWARD_2.fastq SHORT_FORWARD_3.fastq \
-ser SHORT_REVERSE_1.fastq SHORT_REVERSE_2.fastq SHORT_REVERSE_3.fastq \
-t THREADS \
-outdir OUTDIR

Also, make sure your BUSCO is using the appropriate dataset for your species and the transcriptome mode is turned on. For example, this is my command for running BUSCO v5.3.2 on a spruce tree transcriptome assembly:

busco -i SEQUENCES.fasta \
-o OUTDIR \
-l embryophyta_odb10 \
-m transcriptome \
-c 12
OliviaAgatha commented 10 months ago

As I input total RNA into the ONT instead of polyA-enriched RNA, could this also affect the assembly? I noticed that out of the 500k reads from the raw ONT data, only 6.8k sequences are assembled by rnabloom2.

I combined all the forward fastq files into 1 forward.fastq and all the reverse fastq files into 1 reverse.fastq, then ran the command java -jar RNA-Bloom.jar -stranded \ -long long.fastq \ -sef forward.fastq \ -ser reverse.fastq \ -t THREADS \ -outdir OUTDIR

With this command, the resulting sequences assembled are around 6.4k.

Could you advice on this matter?

Thank you!

kmnip commented 10 months ago

Was there any ribosomal RNA depletion before sequencing? It is possible that your read set is predominantly rRNA. You could check the assembled transcripts for rRNA.

The lower number of assembled transcripts can be due to more reads being collapsed as a result of improved error correction.

As mentioned previously, please also check your BUSCO command.

OliviaAgatha commented 10 months ago

No rRNA depletion was carried out. Do you have an advice how I can check of the presence of the rRNA in the assembly?

For the BUSCO, I realised that another possible reason for low completeness is due to the RNA coming from only 1 organ.

So the main problem now is the absence of expected transcripts.

In this case, would you recommend repeating the ONT sequencing with polyA tail enriched RNA? Or is there another advice that could be done to save the current assembly?

Thank you.

kmnip commented 10 months ago

No rRNA depletion was carried out. Do you have an advice how I can check of the presence of the rRNA in the assembly?

You can try SILVA's web aligner for their rRNA database: https://www.arb-silva.de/aligner/ or RiboDetector: https://github.com/hzi-bifo/RiboDetector

For the BUSCO, I realised that another possible reason for low completeness is due to the RNA coming from only 1 organ.

If your dRNA reads originate from a single organ, then it is not too surprising to have a low BUSCO completeness. 10% completeness does seem low, but BUSCO only looks at core conserved genes.

So the main problem now is the absence of expected transcripts.

What are you trying to get out of the sequencing reads (and the assembly)?

If your short reads also originate from the same organ, then (as a sanity check) you can compare the BUSCO results between a short-read assembly and a long-read assembly.

In this case, would you recommend repeating the ONT sequencing with polyA tail enriched RNA?

If you have the budget to do another sequence experiment with polyA-enriched RNA, then I suggest you do it. You can use both total RNA reads and polyA-enriched RNA reads in one assembly. Therefore, your 536,000 total RNA reads are not entirely "wasted".

Or is there another advice that could be done to save the current assembly?

One last test is to set -lrrd 1 in RNA-Bloom. It will lower the min read-depth required to 1 in long-read assembly. See if this increases the BUSCO completeness, but this assembly may have more noise/artifacts.