m-kouhsar commented 2 years ago

Hi I have two RNA seq datasets from brain samples as the following:

A: mRNA Sequencing data (polyA-RNAseq), Paired-end
B: Total RNA Sequencing data (Ribosomal depletion), Paired-end

I tried to quantify circRNAs in both of them with DCC based on this command:

DCC @samplesheet \ -mt1 @mate1 \ -mt2 @mate2 \ -T 30 \ -D \ -R /mnt/data1/Morteza/RNA-Seq/DCC/combine_repeat.gtf \ -an /mnt/data1/Morteza/RNA-Seq/references/genome_anno/gencode.v38.primary_assembly.annotation.gff3 \ -Pi \ -F \ -M \ -Nr 1 1 \ -fg \ -G \ -A /mnt/data1/Morteza/RNA-Seq/references/genome_fasta/GRCh38.primary_assembly.genome.fa \ -B @bam_files

Before that, I also run the STAR alignment using both mates and separate mate (same as the tutorial) via these commands:

both mates:

STAR --runThreadN 16 \ --genomeDir $genome_index_dir \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix $out_dir_both \ --readFilesIn $R1 $R2 \ --readFilesCommand zcat \ --outReadsUnmapped Fastx \ --outSJfilterOverhangMin 15 15 15 15 \ --alignSJoverhangMin 15 \ --alignSJDBoverhangMin 15 \ --seedSearchStartLmax 30 \ --outFilterMultimapNmax 20 \ --outFilterScoreMin 1 \ --outFilterMatchNmin 1 \ --outFilterMismatchNmax 2 \ --chimSegmentMin 15 \ --chimScoreMin 15 \ --chimScoreSeparation 10 \ --chimJunctionOverhangMin 15 \ --genomeLoad LoadAndKeep \ --limitBAMsortRAM 50000000000 \ --outTmpDir $temp_dir \

mate 1

STAR --runThreadN 16 \
 --genomeDir $genome_index_dir \
 --outSAMtype None \
 --outFileNamePrefix $out_dir_mate1 \
 --readFilesIn $R1 \
 --readFilesCommand zcat \
 --outReadsUnmapped Fastx \
 --outSJfilterOverhangMin 15 15 15 15 \
 --alignSJoverhangMin 15 \
 --alignSJDBoverhangMin 15 \
 --seedSearchStartLmax 30 \
 --outFilterMultimapNmax 20 \
 --outFilterScoreMin 1 \
 --outFilterMatchNmin 1 \
 --outFilterMismatchNmax 2 \
 --chimSegmentMin 15 \
 --chimScoreMin 15 \
 --chimScoreSeparation 10 \
 --chimJunctionOverhangMin 15 \
 --genomeLoad LoadAndKeep \
 --limitBAMsortRAM 50000000000 \
 --outTmpDir $temp_dir \

mate 2

STAR --runThreadN 16 \
 --genomeDir $genome_index_dir \
 --outSAMtype None \
 --outFileNamePrefix $out_dir_mate2 \
 --readFilesIn $R2 \
 --readFilesCommand zcat \
 --outReadsUnmapped Fastx \
 --outSJfilterOverhangMin 15 15 15 15 \
 --alignSJoverhangMin 15 \
 --alignSJDBoverhangMin 15 \
 --seedSearchStartLmax 30 \
 --outFilterMultimapNmax 20 \
 --outFilterScoreMin 1 \
 --outFilterMatchNmin 1 \
 --outFilterMismatchNmax 2 \
 --chimSegmentMin 15 \
 --chimScoreMin 15 \
 --chimScoreSeparation 10 \
 --chimJunctionOverhangMin 15 \
 --genomeLoad LoadAndKeep \
 --limitBAMsortRAM 50000000000 \
 --outTmpDir $temp_dir \

My main problem is: The number of circRNAs in dataset A is very low compared with dataset B. The average number of circRNAs in each sample is 357 and 6373 for datasets A and B, respectively. What is the reason for this huge difference between the number of circRNAs in the two datasets? Is this mean we should use total RNA Seq data to quantify circRNAs and we cannot extract a high number of circRNAs in PolyA data?

Best wishes, Morteza

tjakobi commented 2 years ago

Hi @m-kouhsar,

poly-A RNA-seq data is usually more or less depleted of circRNAs, since circRNAs do not have a poly-A tail.

In rare instances you may capture some circRNAs that have internal poly-A sites which allow for them to be captured, but this should be a very rare event.

Thus, the outcome of your two DCC runs is expected; total RNA-seq data should be used for circRNA detection, poly-A data is not recommended.

Cheers,

Tobias

m-kouhsar commented 2 years ago

Thank you very much for your Reply, Tobias

dieterich-lab / DCC

Low number of circRNA in predicted in mRNA Seq data #100

both mates:

mate 1

mate 2