bcgsc / RNA-Bloom

:hibiscus: reference-free transcriptome assembly for short and long reads
Other
90 stars 7 forks source link

Best practices for pre-processing prior to assembly #77

Open patrickaoude opened 1 month ago

patrickaoude commented 1 month ago

Hi,

I am interested in using long and short reads to assemble a transcriptome with RNA-Bloom2.

I had a few questions regarding pre-processing of reads before using RNA-Bloom2.

I am loosely following this paper: A simple guide to de novo transcriptome assembly and annotation

The steps I plan to include based on the paper:

Prior to error correction via fastp, it is recommended to use k-mer correction with Rcorrector, however this tool (Rcorrector) is intended for short reads only. Do you have any alternative recommendation for this that's appropriate for long reads? I was thinking I could potentially use RATTLE and take the intermediate output after clustering and correction, but I am unsure if this approach is sound. I also posted on RATTLE's github to see the author's opinion on this approach.

Alternatively, is k-mer correction a necessary step given the way that RNA-Bloom2 works? If you have any suggestions for what I should do for pre-processing prior to assembly please let me know!

Thanks, Patrick

kmnip commented 1 month ago

Hi @patrickaoude ,

For short-read RNA-seq data, I had often used fastp for trimming adapters, low quality bases, etc. RNA-Bloom's error correction for short reads is very similar to Rcorrector. If you are interested in assembling long-read RNA-seq data, then fastp and Rcorrector are not suitable as I believe they are only intended for short reads.

I think majority of the tools mentioned in that paper are largely intended for short-read RNA-seq data, but the core ideas are still relevant for long reads in my opinion. RNA-Boom already does its own correction for long reads based on k-mers. As ONT and PacBio have improved the error rate of their sequencing data, you probably don't need an additional correction tool for that purpose.

If the RNA in your sample were poly-A selected, then you most likely do not need rRNA removal. Contamination detection is also very useful, especially if you are assembling reads for a species without a reference genome. As an FYI, Kraken2 did not perform very well in long reads according to this paper.

I haven't tried combining RNA-Bloom with RATTLE's clustering. However, I actually implemented a very crude clustering strategy for long reads in RNA-Bloom over 5 years ago, but it wasn't very efficient and the assembly ended up not very good either.

One thing that I strongly suggest looking into are adaptor trimming tools for long reads (e.g. Porechop, Pychopper, etc.). These tools can also potentially reduce chimeric reads, which are one of the major causes of misassemblies (in RNA-Bloom).

Hope that helps!

patrickaoude commented 1 month ago

Hi Ka Ming,

Thanks for your detailed and quick reply, it's very helpful. I had a few other clarifications to make if you get the chance:

Thanks again, Patrick

kmnip commented 1 month ago
  • fastp documentation states that it actually is compatible with long reads from pacbio/ONT, I thought this may be useful for your knowledge.

I have seen that message in their README for a number of years. You can use fastp for filtering long reads based on base quality, etc. I am uncertain in how well fastp performs on trimming adapters in ONT reads.

  • It seems based off the paper you linked that Minimap2 would be a good tool to map sequence contaminants. Would you also use it for rRNA removal since it appears SortMeRNA is intended for short reads only as are many tools in the paper I linked? (I do not believe there was poly-A selection but will need to confirm with core facility)

If the RNA were not poly-A selected, then it could likely be ribo-depleted. Regardless, you simply map your reads with minimap2 against a set of rRNA sequences (e.g. from SILVA).

  • Based off of this paper I am setting up FMLRC2 for long-read error correction. This tool is designed to take in the short reads and use them to correct the long reads. In such cases where short reads are used to correct the long reads, is there a point to use the short reads along with the long reads when I ultimately perform transcriptome assembly with tools such as RNABloom2? My intuition would be that it is unnecessary or perhaps even flawed to use those same short reads in transcriptome assembly when they have already been used for error correction (and in which case I would simply provide the corrected long reads to RNABloom2).

It depends on the sequencing chemistry. If it is R9.4, then short-reads do lower the error rate. If you would like to correct the long reads with short reads in your long-read assembly, you can supply your short reads to RNA-Bloom. Note that RNA-Bloom does not perform hybrid assembly of long and short reads.

  • If I am to use the short reads with cDNA long reads, would my command look something like: java -jar RNA-Bloom.jar -long LONG.fastq -sef SHORT_FORWARD.fastq -ser SHORT_REVERSE.fastq -t THREADS -outdir OUTDIR

If your reads are cDNAs (i.e. not direct RNA sequencing), then your command can be simplified to: java -jar RNA-Bloom.jar -long LONG.fastq -sef SHORT_FORWARD.fastq SHORT_REVERSE.fastq ... Due to the lack of strandedness in cDNAs, the options -sef and -ser will have the same effect.

patrickaoude commented 3 weeks ago

Hi Ka Ming,

Thanks again for your reply. I tried running my data with RNABloom2 and received an error in the third stage, I was hoping you might have some insight into what might have gone wrong:

RNA-Bloom v2.0.1
args: [-t, 40, -o, results/rnabloom_assembly, -long, results/starfish_long_reads_trimmed_unaligned_human_LSU_SSU.fastq, -sef, results/starfish_short_reads.cor.unfixrm.trimmed.unclassified.unaligned_rRNA.fq]

name:   rnabloom
outdir: results/rnabloom_assembly

Turning on option `-ntcard` to count k-mers

K-mer counting with ntCard...
Parsing histogram file `results/rnabloom_assembly/rnabloom_k25.hist`...
Unique k-mers (k=25):     10,063,508,110
Unique k-mers (k=25,c>1): 1,860,934,007
K-mer counting completed in 0.09s

Bloom filters          Memory (GB)
====================================
de Bruijn graph:       22.238815
k-mer counting:        32.89904
====================================
Total:                 55.137856

> Stage 1: Construct graph from reads (k=25)
Parsing `results/starfish_short_reads.cor.unfixrm.trimmed.unclassified.unaligned_rRNA.fq`...
Parsed 1,470,412,235 sequences in 5h 45m 24s
Parsing `results/starfish_long_reads_trimmed_unaligned_human_LSU_SSU.fastq`...
Parsed 40,419,128 sequences in 2h 13m 37s
Parsed 1,510,831,363 sequences from 2 files.
DBG Bloom filter FPR:                 0.814 %
Counting Bloom filter FPR:            0.968 %
> Stage 1 completed in 8h 0m 38s

> Stage 2: Correct long reads for "rnabloom"
Parsing `results/starfish_long_reads_trimmed_unaligned_human_LSU_SSU.fastq`...
Corrected Read Lengths Sampling Distribution (n=10000)
        min     q1      med     q3      max
        26      556     913     1403    4349
Parsed 40,419,185 sequences.
        Kept:      40,417,992   (100.0 %)
        Discarded: 1,193        (0.00295 %)
        Artifacts: 418,537      (1.035491%)
Corrected reads in 2h 37m 28s
Extracting seed sequences...
strobemers: n=3, k=11, wMin=12, wMax=61, depth=3
Bloom filter FPR:       1.87 %
before: 40,178,664      after: 6,366,219 (15.8 %)
too short: 0
Extraction completed in 1h 56m 2s
> Stage 2 completed in 4h 33m 35s

> Stage 3: Assemble long reads for "rnabloom"
Overlapping sequences...
ERROR: Index 8 out of bounds for length 8
java.lang.ArrayIndexOutOfBoundsException: Index 8 out of bounds for length 8
        at rnabloom.io.PafRecord.update(PafRecord.java:39)
        at rnabloom.io.ExtendedPafRecord.update(ExtendedPafRecord.java:32)
        at rnabloom.io.PafReader.next(PafReader.java:63)
        at rnabloom.olc.Layout.extractUniqueFromOverlaps(Layout.java:1660)
        at rnabloom.olc.OverlapLayoutConsensus.overlapWithMinimapAndExtractUnique(OverlapLayoutConsensus.java:156)
        at rnabloom.olc.OverlapLayoutConsensus.uniqueOLC(OverlapLayoutConsensus.java:1159)
        at rnabloom.RNABloom.assembleUnclusteredLongReads(RNABloom.java:3314)
        at rnabloom.RNABloom.main(RNABloom.java:7430)

I don't believe it's a memory issue (100 GB supplied), but if you have any suggestions for how I can troubleshoot this please let me know!

Thanks again! Patrick

kmnip commented 3 weeks ago

The first two stages look fine.

In stage 3, the overlapping step with minimap2 had some issues. The PAF output from minimap2 looks to be truncated. There should be 12 columns, but only 8 were found.

Can you show me the content of the file rnabloom.longreads.assembly1.nr.fa.gz.log?

patrickaoude commented 3 weeks ago

Hmm, maybe it was a memory issue after all, not sure why the job would have been killed otherwise. Here are the file contents:

[M::mm_idx_gen::159.636*2.22] collected minimizers
[M::mm_idx_gen::174.042*4.34] sorted minimizers
[M::main::174.042*4.34] loaded/built the index for 3780118 target sequence(s)
[M::mm_mapopt_update::177.484*4.28] mid_occ = 4876
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 3780118
[M::mm_idx_stat::179.490*4.24] distinct minimizers: 170372867 (35.55% are singletons); average occurrences: 15.988; average spacing: 2.937; total length: 8000200222
[M::worker_pipeline::7233.171*31.50] mapped 58529 sequences
[M::worker_pipeline::12006.773*31.78] mapped 67603 sequences
Killed