bcgsc / RNA-Bloom

:hibiscus: reference-free transcriptome assembly for short and long reads
Other
92 stars 7 forks source link

About multi-tissue assembly #80

Open YIGUIz opened 1 week ago

YIGUIz commented 1 week ago

Hi, I want to assemble a final set of transcripts from multiple tissues, But I have a question. Should I first assemble transcripts for each sample, then merge the transcripts from the same tissue? Or should I merge all the FASTAQ files from a tissue first, then perform the assembly?

Finally, is it correct to merge all transcripts from different tissues to get the final transcript assembly?

Thank you for your assistance!

Qi

kmnip commented 1 week ago

Depends on what you want to do with the transcripts?

To assemble multi-sample data, you can follow the instructions here: https://github.com/bcgsc/RNA-Bloom?tab=readme-ov-file#b-assemble-multi-sample-rna-seq-data-with-pooled-assembly-mode

If you don't really care about the tissue specificity of assembled transcripts, then you can simply pass all FASTQ files as input. You do not need to merge the files. For example:

java -jar RNA-Bloom.jar \
-left sampleA_1.fastq sampleB_1.fastq sampleC_1.fastq \
-right sampleA_2.fastq sampleB_2.fastq sampleC_2.fastq \
-revcomp-right -t THREADS -outdir OUTDIR
YIGUIz commented 1 week ago

Thank you very much.

My data is cDNA long-read RNA-seq and stranded-specific paired-end SR RNA-seq data. and I noticed that the software doesn’t support multi-sample data assembly in LR RNAseq data. So, I considered two strategies:

1.Assemble by samples, then merge the GTF files from the same tissue.

2.Merge the long-read RNA fastq files into a single large fastq file. Additionally, I have paired short-read RNA-seq data, and I’ve merged all clean fastq files into two large fastq files (R1.fq and R2.fq). and then perform the assembly.

I'm not sure whether this approach will work.

I have another question about SR RNAseq data. The SR RNAseq is a stranded-specific paired end data. It's fr-firststrand. So my parameter is:

rnabloom -t 20 -ntcard -artifact -long ${LR_clean_fq} -sef ${SR_fq1} ${SR_fq2} -fpr 0.005 -indel 20 -p 0.75 -Q 15 -overlap 100 -length 150

I'm confused about that sef is the path to one single-end forward read file

Thank you for your assistance again!

kmnip commented 1 week ago
  1. RNA-Bloom doesn't generate GTF files.

  2. You don't need to merge or concatenate read files for -long, -ser, and -sef. You can specify multiple file paths separated by space.

  3. If your long-read data is not direct RNA-seq or not strand specific, then you should not use the -strand option because the strand of your short reads do not matter. So, you can specify both forward and reverse short read files for -sef.

  4. I don't recommend using the -artifact option. You will end up with a lot of incorrect assemblies.

YIGUIz commented 6 days ago

Thank you very much. Due to the sample size, I have to assemble transcripts by sample, and then merge them. So, without the gtf files, how can I generate the final transcripts (Remove redundant transcripts)? Besides, I also need to merge the transcripts from different tissue. Thank you for your help. I have just started this work, So I have a lot of question.

kmnip commented 5 days ago

There could be much better ways to do this, but here is what I did in the past.

For each tissue:

  1. Align the assembled transcripts against the reference genome with minimap2 to generate a PAF.
    minimap2 -c -x splice reference_genome.fasta rnabloom.transcripts.fa | gzip -c > rnabloom.transcripts.paf.gz
  2. Generate a GTF from the PAF file with this script from RNA-Scoop: https://github.com/bcgsc/RNA-Scoop/blob/master/scripts/make_gtf.py
    python make_gtf.py rnabloom.transcripts.paf.gz rnabloom.transcripts.gtf

Merge the GTFs from all tissues with gffcompare: https://ccb.jhu.edu/software/stringtie/gffcompare.shtml

YIGUIz commented 5 days ago

Thank you very much😊. I'll try it.