bcgsc / RNA-Bloom

:hibiscus: reference-free transcriptome assembly for short and long reads
Other
97 stars 7 forks source link

About multi-tissue assembly #80

Open YIGUIz opened 2 months ago

YIGUIz commented 2 months ago

Hi, I want to assemble a final set of transcripts from multiple tissues, But I have a question. Should I first assemble transcripts for each sample, then merge the transcripts from the same tissue? Or should I merge all the FASTAQ files from a tissue first, then perform the assembly?

Finally, is it correct to merge all transcripts from different tissues to get the final transcript assembly?

Thank you for your assistance!

Qi

kmnip commented 2 months ago

Depends on what you want to do with the transcripts?

To assemble multi-sample data, you can follow the instructions here: https://github.com/bcgsc/RNA-Bloom?tab=readme-ov-file#b-assemble-multi-sample-rna-seq-data-with-pooled-assembly-mode

If you don't really care about the tissue specificity of assembled transcripts, then you can simply pass all FASTQ files as input. You do not need to merge the files. For example:

java -jar RNA-Bloom.jar \
-left sampleA_1.fastq sampleB_1.fastq sampleC_1.fastq \
-right sampleA_2.fastq sampleB_2.fastq sampleC_2.fastq \
-revcomp-right -t THREADS -outdir OUTDIR
YIGUIz commented 2 months ago

Thank you very much.

My data is cDNA long-read RNA-seq and stranded-specific paired-end SR RNA-seq data. and I noticed that the software doesn’t support multi-sample data assembly in LR RNAseq data. So, I considered two strategies:

1.Assemble by samples, then merge the GTF files from the same tissue.

2.Merge the long-read RNA fastq files into a single large fastq file. Additionally, I have paired short-read RNA-seq data, and I’ve merged all clean fastq files into two large fastq files (R1.fq and R2.fq). and then perform the assembly.

I'm not sure whether this approach will work.

I have another question about SR RNAseq data. The SR RNAseq is a stranded-specific paired end data. It's fr-firststrand. So my parameter is:

rnabloom -t 20 -ntcard -artifact -long ${LR_clean_fq} -sef ${SR_fq1} ${SR_fq2} -fpr 0.005 -indel 20 -p 0.75 -Q 15 -overlap 100 -length 150

I'm confused about that sef is the path to one single-end forward read file

Thank you for your assistance again!

kmnip commented 2 months ago
  1. RNA-Bloom doesn't generate GTF files.

  2. You don't need to merge or concatenate read files for -long, -ser, and -sef. You can specify multiple file paths separated by space.

  3. If your long-read data is not direct RNA-seq or not strand specific, then you should not use the -strand option because the strand of your short reads do not matter. So, you can specify both forward and reverse short read files for -sef.

  4. I don't recommend using the -artifact option. You will end up with a lot of incorrect assemblies.

YIGUIz commented 2 months ago

Thank you very much. Due to the sample size, I have to assemble transcripts by sample, and then merge them. So, without the gtf files, how can I generate the final transcripts (Remove redundant transcripts)? Besides, I also need to merge the transcripts from different tissue. Thank you for your help. I have just started this work, So I have a lot of question.

kmnip commented 2 months ago

There could be much better ways to do this, but here is what I did in the past.

For each tissue:

  1. Align the assembled transcripts against the reference genome with minimap2 to generate a PAF.
    minimap2 -c -x splice reference_genome.fasta rnabloom.transcripts.fa | gzip -c > rnabloom.transcripts.paf.gz
  2. Generate a GTF from the PAF file with this script from RNA-Scoop: https://github.com/bcgsc/RNA-Scoop/blob/master/scripts/make_gtf.py
    python make_gtf.py rnabloom.transcripts.paf.gz rnabloom.transcripts.gtf

Merge the GTFs from all tissues with gffcompare: https://ccb.jhu.edu/software/stringtie/gffcompare.shtml

YIGUIz commented 2 months ago

Thank you very much😊. I'll try it.

YIGUIz commented 1 month ago
  1. RNA-Bloom doesn't generate GTF files.
  2. You don't need to merge or concatenate read files for -long, -ser, and -sef. You can specify multiple file paths separated by space.
  3. If your long-read data is not direct RNA-seq or not strand specific, then you should not use the -strand option because the strand of your short reads do not matter. So, you can specify both forward and reverse short read files for -sef.
  4. I don't recommend using the -artifact option. You will end up with a lot of incorrect assemblies.

I'm sorry to ask again, but can this program handle 400 long-read and short-read RNA-seq data simultaneously by specifying multiple file paths separated by spaces? If not, how can I obtain a complete BAM file to ensure the program works? I would get a 15T BAM file when using Samtools to merge them. Does it work?

Thank you!

kmnip commented 1 month ago

I don't understand what is a "complete BAM"? RNA-Bloom is primarily a reference-free assembly tool. It does not generate any BAM files against any reference.

Regarding too many input files, you can put the paths of read files in a text file, one path on each line. You can specify the list text file with @.

Example:

List file for short reads short_read_files.txt:

/path/to/short_reads_01.fastq
/path/to/short_reads_02.fastq
/path/to/short_reads_03.fastq

List file for short reads long_read_files.txt:

/path/to/long_reads_01.fastq
/path/to/long_reads_02.fastq
/path/to/long_reads_03.fastq

Example command for the list files:

java -jar RNA-Bloom.jar \
-sef @/path/to/short_read_files.txt \
-long @/path/to/long_read_files.txt \
...