bcgsc / RNA-Bloom

:hibiscus: reference-free transcriptome assembly for short and long reads
Other
85 stars 7 forks source link

Reg: Normalization of PE datasets #3

Closed harish0201 closed 4 years ago

harish0201 commented 4 years ago

Hi,

I have around 20 samples RNASeq data with ~25Gb data in each sample. Generally, I tend to normalize the dataset for a kmer target depth of 100 at k=25 (using bbnorm/diginorm etc) beforehand so that I can hand-off easily to various assemblers to save on time.

So, this begets the question, should I be normalizing the dataset before handing off to RNABloom for parity with other assemblers, or is this not required, given that ntcard is going to anyways recompute the bloom filters?

Also, just curious, since ntCard is being used, have you tried a multi-kmer assembly?

harish0201 commented 4 years ago

As a followup question,

Since I've finished an assembly on the normalized data anyways, I'm curious since I don't have any Nanopore data, why is the redundancy removal being done with ava-ont switch in minimap2?

Would asm5, asm10 or asm20 parameter be better for the same?

kmnip commented 4 years ago

H @harish0201 ,

Thanks for your thoughtful questions!

Technically, you don't need to normalize the dataset if you have enough RAM. RNA-Bloom uses Bloom filters to reduce the memory usage.

RNA-Bloom doesn't do multi-k assembly internally, but our intention of using k-mer pairs (derived from reads and reconstructed fragments) is to achieve a similar effect from using larger k-mers. However, there is indeed one feature in RNA-Bloom that choose the "best" k-mer from ntCard histogram. For example, if you specify -k 25-60:5 in RNA-Bloom, it will run ntCard for kmer size between 25 and 60 with a step size of 5, ie. [25,30,35,40,45,50,55,60]. From the ntCard histograms, RNA-Bloom chooses the k-mer size with the largest number of unique k-mers. From my experience, this feature gives a somewhat good trade-off between sensitivity vs misassemblies, but using the default k-mer size of 25 does give a higher sensitivity from my benchmarking experiments.

RNA-Bloom extends fragments (reconstructed from read pairs) into long transcripts. Since redundant sequences with mismatches and small indels could not always be recapitulated using a k-mer based approach, I treat the assembled transcripts as long reads and perform all-vs-all alignments to identify redundant sequences. But you have a good point, I still have to benchmark other options and see which one works best for short-read data.

EDIT: Is that Lucario in your avatar? I like the design!

Thanks, Ka Ming

harish0201 commented 4 years ago

Hey,

Thanks for the comments :)

I figured that normalization would not be necessary but I was curious, so yeah. The assembly was pretty good as well, albeit there was a lot of duplicates. Got somewhere about 90% BUSCOs. I'm putting down a list of the few assemblies:

RNABloom: C:90.8%[S:15.1%,D:75.7%],F:2.2%,M:7.0%,n:1440 Trinity: C:89.2%[S:27.9%,D:61.3%],F:3.5%,M:7.3%,n:1440 BinPacker: C:82.1%[S:29.4%,D:52.7%],F:7.4%,M:10.5%,n:1440 rnaSPades: C:86.4%[S:46.5%,D:39.9%],F:5.3%,M:8.3%,n:1440

Transcripts assembled:

spades:282755 BinPacker:392277 Trinity.fasta:494047 RNAbloom:550102

The reason I asked about the minimap step was that I observed that the duplication (based on BUSCO) was still pretty high, I didn't consider the NR-step that was performed with minimap2 as it barely threw out any duplicates.

I have performed the NR reduction using evidential gene pipeline which seems to have reduced the assembly quite a bit, it seems that multiple paths were observed in the assembly, and RNABloom reports all of them. Anyway, the BUSCO after evidentialgene is as below:

Bloom:C:89.7%[S:75.3%,D:14.4%],F:2.9%,M:7.4%,n:1440 BinPacker:C:80.4%[S:68.9%,D:11.5%],F:7.8%,M:11.8%,n:1440 Trinity:C:87.9%[S:73.0%,D:14.9%],F:4.1%,M:8.0%,n:1440 rnaSPades: C:85.6%[S:72.2%,D:13.4%],F:5.8%,M:8.6%,n:1440

Remaining transcripts: BinPacker:130985 RNAbloom:137049 rnaSpades:131129 Trinity:157050

BUSCOs from concatenating the raw assemblies from the above and running evidential gene is below: pooled_4: C:90.5%[S:74.8%,D:15.7%],F:2.0%,M:7.5%,n:1440 pooled_4: 190149 So, I also did see the proportion of transcripts from each assembler, overall from NR with evidential gene I had 190149 transcripts:

rnaSPades: 39733 BinPacker: 42410 Trinity: 60633 RNABloom: 47373

I have to try the asm5, asm10, asm20 options though, I'll see that over the weekend as I'm currently slammed with a long list of manuscript works :/

Reg: Lucario, yes, yes it is! It's been a long time since I got that image, so I don't really recall anything :) but I think a reverse image search should be able to find it!

kmnip commented 4 years ago

Thanks for reporting your findings!

I think BUSCO is best for evaluating the reconstruction of transcripts from different genes. While some level of duplicates is expected in the assemblies, I think alternative isoforms and transcripts from gene families are the main reason why BUSCO detected a lot of "duplicates".

I think minimap2's ava-ont mode does a decent job in identifying sequence duplicates. I noticed that if I were too aggressive (ie. allowing for lower percent sequence identity, larger indels, etc.) in removing similar sequences, the isoform reconstruction would suffer based on my rnaQUAST evaluation.