Closed MartaBenegas closed 3 years ago
Hi @MartaBenegas ,
The 4 files are generated at last stage of assembly in each cell.
transcripts.fa
and transcripts.short.fa
are transcripts sequences extended from reconstructed fragments. The difference between them is that transcripts.short.fa
contain sequences shorter than the minimum length threshold (default is 200 bp), whereas transcripts.fa
contain those longer or equal to the length threshold. To reduce redundancy (eg. duplicated sequences with slight mismatches, etc.) in the aforementioned FASTA files, I extract non-redundant sequences, hence the "nr" in the file names for transcripts.nr.fa
and transcripts.nr.short.fa
.
The nbits
are reconstructed fragments from read pairs. They are binary files instead of FASTA to reduce disk usage.
What people typically do is assemble the reads from all cells together as if they were bulk RNA-seq data. The problem with this approach is that the assembly might not be very precise. So, I recommend that you can use the -pool
option and -mergepool
. Basically, you still generate an assembly for each cell, but -mergepool
does an additional step of collapsing the assembled transcripts from all cells into a single file, which you could use as a reference transcriptome.
Hope that helps! Ka Ming
Yes, that helped a lot! Thank you very much for your explanation :)
Marta.
Well, I would like to question one last thing! I was searching for single-cell RNAseq assemblers for the purpose I've mentioned to you, so I was kinda surprised to see that the default is to provide the assembled transcriptome for each cell independently. So I was wondering what could that serve for, maybe isoform studies at the single-cell level? or something more? I will really appreciate it if you could shed some light on the matter.
Thank you, Marta.
Yes, the individual assemblies could be used to investigate differential usage of splice-junctions, for example.
It was for legacy reasons that the -mergepool
option wasn't turn on by default. Prior to that option being implemented, we used custom scripts to merge the assemblies. If you still find duplicates in your merged assembly (from -mergepool
), then I recommend using the dedupe
script from BBMap: https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/dedupe-guide/
Okay, I will take it into account. Thank you very much!
Hi again! I followed your approach and I obtained an assembled, merged transcriptome of nearly 2 million transcripts (out of 192 cells). That's a really big number, so probably the transcriptome is fragmented or duplicated (I'm running some analysis to check it). That made me wonder if this approach is suitable for my purpose (to assemble the transcriptome and map reads back against it). To try to solve it, I came up with the idea of pooling all reads into one file and pass it to the RNA-Bloom as if it were one big cell, what do you think about it?
Regards, Marta.
Yes, I think you can give that a shot and see how different the assemblies are. Did you try running bb-tools' dedupe script on the merged assembly?
Hi! I've run RNA-Bloom with the "pooled" files I've told you. Moreover, I've run dedupe and CD-HIT in both the original dataset (with 2 fastq files for each cell) and the pooled dataset and here's the result:
RNA-Bloom original
RNA-Bloom pooled
I didn't run dedupe on the "original" dataset because I saw that it didn't remove so many sequences in the "pooled" dataset. As you can see, it seems that the recovered transcriptomes have a lot of duplicated sequences. However, I've also run rnaQUAST to evaluate them and, despite the duplicates, they seem quite good ;) I thought you would like to know!
Regards, Marta.
Thanks for reporting your results! :)
If I remember correctly, CD-HIT groups transcripts at the CDS level into gene-like clusters, but it keeps the transcript with the longest CDS for each cluster. So, many alternative isoforms are removed. That's why you see a large reduction in the number of transcripts. Based on my colleagues experiences, they prefer EvidentialGene over CD-HIT: http://arthropods.eugenes.org/EvidentialGene/trassembly.html
Thanks, Ka Ming
Hi Ka Ming, I think that for my purpose it's okay to use the results of CD-HIT because what I want is to use the transcriptome as a reference to quantify the expression at gene-level rather than isoform-level. However, it's good to know the limitations of CD-HIT. To be honest I hadn't realized that, so I'm really thankful for your advice! I didn't know about EvidentialGene either and it really seems very interesting, I'm certainly going to take a look into it as it could be really useful in many other scenarios.
Thank you very much, Marta.
version of RNA-Bloom with
java -jar RNA-Bloom.jar -version
version of java with
java -version
exact command used to run RNA-Bloom
root@0d73be8b2e5c:/datat# rnabloom -pool readslist.txt -revcomp-right -ntcard -mergepool -outdir output/
Hi RNA-Bloom Team, I've made a first tiny test with your assembler and I have a few questions regarding the output:
-mergepool
option to perform this analysis or it was meant for other things, because I don't really understand well what this option is doing: is it reconstructing the transcriptome taking into account the reads coming from all cells? or is it only merging the transcriptomes obtained by each cell independently?I hope that I explained my questions clearly! Marta.