bcgsc / RNA-Bloom

:hibiscus: reference-free transcriptome assembly for short and long reads
Other
96 stars 7 forks source link

Can output be used for DE analysis? #67

Closed kzudock closed 10 months ago

kzudock commented 10 months ago

Hi RNA-Bloom Team,

I have a bunch of ONT cDNA data from a time series experiment which I ran through RNA-Bloom using the long read assembly with short read polishing. The assembled transcripts look solid, but I was hoping to use the output for downstream gene identification/pathway analysis as well as basic differential expression analysis between the different time points in my time series. Alas, I can't find transcript counts anywhere--is it not possible to use the RNA-Bloom output for differential expression analysis? Apologies if this is obvious and I just totally missed something.

Thanks for your help!

kmnip commented 10 months ago

Hello @kzudock ,

RNA-Bloom does not perform differential expression analysis, but I think it can be done with some work. Here is my proposed workflow:

Step 1: Generate a combined transcript set

You need a single set of transcripts for your entire time-series experiment. This can be achieved by either:

  1. Assemble reads from all time points together. This is the easiest approach, but it will require quite a bit of RAM and longer runtime depending on how many reads you have in total.
  2. Assemble reads from each time point separately, then merge all assembled transcripts together and remove redundant transcripts. I am not aware of an out-of-the-box tool that does this for long-read assemblies. Dedupe works for short-read assemblies, but I haven't tested it for long-read assemblies.

Step 2: Align reads to transcript set

For each time point, align the corresponding reads to the transcript set with minimap2, e.g.

minimap2 -x map-ont -c rnabloom.transcripts.fa time_1_reads.fastq -t 8 -N 100 | gzip -c > time_1_alignments.paf.gz
minimap2 -x map-ont -c rnabloom.transcripts.fa time_2_reads.fastq -t 8 -N 100 | gzip -c > time_2_alignments.paf.gz
minimap2 -x map-ont -c rnabloom.transcripts.fa time_3_reads.fastq -t 8 -N 100 | gzip -c > time_3_alignments.paf.gz

Step 3: Generate transcript expression levels

For each time point, use NanoSim's nanopore_transcript_abundance.py script to estimate transcript expressions for each time point:

python nanopore_transcript_abundance.py -i time_1_alignments.paf.gz > time_1_expression.tsv
python nanopore_transcript_abundance.py -i time_2_alignments.paf.gz > time_2_expression.tsv
python nanopore_transcript_abundance.py -i time_3_alignments.paf.gz > time_3_expression.tsv

Now, you have the transcript expressions for each time point, and you can perform differential expression analysis (in R).

kzudock commented 10 months ago

This is incredibly helpful--thank you so much!!!