Output doubts single-cell RNA-seq assembly

MartaBenegas commented 3 years ago

version of RNA-Bloom with java -jar RNA-Bloom.jar -version

root@0d73be8b2e5c:/data/output# rnabloom -version
RNA-Bloom v1.3.1
Ka Ming Nip, Canada's Michael Smith Genome Sciences Centre, BC Cancer
Copyright 2018

version of java with java -version

root@0d73be8b2e5c:/data# java -version
openjdk version "1.8.0_275"
OpenJDK Runtime Environment (build 1.8.0_275-8u275-b01-0ubuntu1~18.04-b01)
OpenJDK 64-Bit Server VM (build 25.275-b01, mixed mode)

exact command used to run RNA-Bloom root@0d73be8b2e5c:/datat# rnabloom -pool readslist.txt -revcomp-right -ntcard -mergepool -outdir output/

Hi RNA-Bloom Team, I've made a first tiny test with your assembler and I have a few questions regarding the output:

I've noticed that, for each cell, it generates 4 fasta files: cell.transcripts.fa, cell.transcripts.nr.fa, cell.transcripts.nr.short.fa, and cell.transcripts.short.fa. What is the difference between them?
What are all the .nbits files generated?
What I'm trying to do is to establish a workflow to analyze single-cell RNAseq data for non-model organisms. So what I want to do is to reconstruct the whole transcriptome by pooling the reads for all cells and assembling them, so later I can perform a gene prediction and use this as a reference transcriptome to obtain the gene expression matrix. So my question is if I can use the output of the -mergepool option to perform this analysis or it was meant for other things, because I don't really understand well what this option is doing: is it reconstructing the transcriptome taking into account the reads coming from all cells? or is it only merging the transcriptomes obtained by each cell independently?

I hope that I explained my questions clearly! Marta.

kmnip commented 3 years ago

Hi @MartaBenegas ,

The 4 files are generated at last stage of assembly in each cell. transcripts.fa and transcripts.short.fa are transcripts sequences extended from reconstructed fragments. The difference between them is that transcripts.short.fa contain sequences shorter than the minimum length threshold (default is 200 bp), whereas transcripts.fa contain those longer or equal to the length threshold. To reduce redundancy (eg. duplicated sequences with slight mismatches, etc.) in the aforementioned FASTA files, I extract non-redundant sequences, hence the "nr" in the file names for transcripts.nr.fa and transcripts.nr.short.fa.
The nbits are reconstructed fragments from read pairs. They are binary files instead of FASTA to reduce disk usage.
What people typically do is assemble the reads from all cells together as if they were bulk RNA-seq data. The problem with this approach is that the assembly might not be very precise. So, I recommend that you can use the -pool option and -mergepool. Basically, you still generate an assembly for each cell, but -mergepool does an additional step of collapsing the assembled transcripts from all cells into a single file, which you could use as a reference transcriptome.

Hope that helps! Ka Ming

MartaBenegas commented 3 years ago

Yes, that helped a lot! Thank you very much for your explanation :)

Marta.

MartaBenegas commented 3 years ago

Well, I would like to question one last thing! I was searching for single-cell RNAseq assemblers for the purpose I've mentioned to you, so I was kinda surprised to see that the default is to provide the assembled transcriptome for each cell independently. So I was wondering what could that serve for, maybe isoform studies at the single-cell level? or something more? I will really appreciate it if you could shed some light on the matter.

Thank you, Marta.

kmnip commented 3 years ago

Yes, the individual assemblies could be used to investigate differential usage of splice-junctions, for example.

It was for legacy reasons that the -mergepool option wasn't turn on by default. Prior to that option being implemented, we used custom scripts to merge the assemblies. If you still find duplicates in your merged assembly (from -mergepool), then I recommend using the dedupe script from BBMap: https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/dedupe-guide/

MartaBenegas commented 3 years ago

Okay, I will take it into account. Thank you very much!

MartaBenegas commented 3 years ago

Hi again! I followed your approach and I obtained an assembled, merged transcriptome of nearly 2 million transcripts (out of 192 cells). That's a really big number, so probably the transcriptome is fragmented or duplicated (I'm running some analysis to check it). That made me wonder if this approach is suitable for my purpose (to assemble the transcriptome and map reads back against it). To try to solve it, I came up with the idea of pooling all reads into one file and pass it to the RNA-Bloom as if it were one big cell, what do you think about it?

Regards, Marta.

kmnip commented 3 years ago

Yes, I think you can give that a shot and see how different the assemblies are. Did you try running bb-tools' dedupe script on the merged assembly?

MartaBenegas commented 3 years ago

Hi! I've run RNA-Bloom with the "pooled" files I've told you. Moreover, I've run dedupe and CD-HIT in both the original dataset (with 2 fastq files for each cell) and the pooled dataset and here's the result:

RNA-Bloom original

1,893,307 transcripts
506,935 after CD-HIT (97% similarity)

RNA-Bloom pooled

1,061,324 transcripts
1,058,202 after dedupe
367,864 after CD-HIT (97% similarity)

I didn't run dedupe on the "original" dataset because I saw that it didn't remove so many sequences in the "pooled" dataset. As you can see, it seems that the recovered transcriptomes have a lot of duplicated sequences. However, I've also run rnaQUAST to evaluate them and, despite the duplicates, they seem quite good ;) I thought you would like to know!

Regards, Marta.

kmnip commented 3 years ago

Thanks for reporting your results! :)

If I remember correctly, CD-HIT groups transcripts at the CDS level into gene-like clusters, but it keeps the transcript with the longest CDS for each cluster. So, many alternative isoforms are removed. That's why you see a large reduction in the number of transcripts. Based on my colleagues experiences, they prefer EvidentialGene over CD-HIT: http://arthropods.eugenes.org/EvidentialGene/trassembly.html

Thanks, Ka Ming

MartaBenegas commented 3 years ago

Hi Ka Ming, I think that for my purpose it's okay to use the results of CD-HIT because what I want is to use the transcriptome as a reference to quantify the expression at gene-level rather than isoform-level. However, it's good to know the limitations of CD-HIT. To be honest I hadn't realized that, so I'm really thankful for your advice! I didn't know about EvidentialGene either and it really seems very interesting, I'm certainly going to take a look into it as it could be really useful in many other scenarios.

Thank you very much, Marta.

bcgsc / RNA-Bloom

Output doubts single-cell RNA-seq assembly #16