Xinglab / rmats-turbo

Other
219 stars 53 forks source link

Issue:Should we use the mapped Bam or the deduplicated Bam file? #358

Open Squirrelity opened 8 months ago

Squirrelity commented 8 months ago

Thanks for your tool! I tested it and compared the Bam file with Star. The variable number of cuts found using rmat is: EventType EventTypeDescription TotalEventsJC TotalEventsJCEC SignificantEventsJC SigEventsJCSample1HigherInclusion SigEventsJCSample2HigherInclusion SignificantEventsJCEC SigEventsJCECSample1HigherInclusion SigEventsJCECSample2HigherInclusion SE skipped exon 686 757 323 176 147 360 197 163 A5SS alternative 5' splice sites 111 131 35 13 22 50 20 30 A3SS alternative 3' splice sites 161 175 50 27 23 54 29 25 MXE mutually exclusive exons 84 98 28 16 12 36 21 15 RI retained intron 264 377 89 43 46 148 68 80

If I use Samtool rmdup to duplicate the Bam file and use rmat to find the variable number of cuts, what is: EventType EventTypeDescription TotalEventsJC TotalEventsJCEC SignificantEventsJC SigEventsJCSample1HigherInclusion SigEventsJCSample2HigherInclusion SignificantEventsJCEC SigEventsJCECSample1HigherInclusion SigEventsJCECSample2HigherInclusion SE skipped exon 26551 27213 1562 917 645 1562 925 637 A5SS alternative 5' splice sites 2744 2820 173 105 68 179 110 69 A3SS alternative 3' splice sites 3867 3923 198 107 91 235 123 112 MXE mutually exclusive exons 2832 2919 123 65 58 140 72 68 RI retained intron 3783 3978 209 118 91 235 129 106

The difference in the number of variable cuts found is a bit large, so I would like to ask about the impact of removing duplicates on rmat and which one is more reliable. Thank you for your reply.

EricKutschera commented 8 months ago

I'm not familiar with rmdup, but I would expect it to reduce the number of reads that rMATS can use. However, in the summaries you posted the TotalEventsJC actually increased after running rmdup. I don't think rMATS should detect more events after removing reads. I'm not sure what happened, but it should be fine to run rMATS without using rmdup

Squirrelity commented 8 months ago

This is my first time using Samtools to deduplicate code: samtools markdup -r ${align_bam} ${output_bam}

This is my second using gatk to deduplicate code: gatk MarkDuplicates --REMOVE_DUPLICATES true -I ${id}Aligned.sortedByCoord.out.bam -O ${output_dir}/${i}_dedup.bam -M ${output_dir}/${i}_dedup.metrics.txt samtools view -bh -@ 10 -q 30 ${output_dir}/${i}_dedup.bam | samtools sort -@ 10 -o ${output_dir}/${i}_dedup_q30.bam samtools index -@ 15 ${output_dir}/${i}_dedup_q30.bam

The number of variable shear events obtained these two times is consistent.

This is my rmats code: for i in A2; do output=./${i}_A1/ tmp=./${i}_A1/tmp mkdir -p ${output} mkdir -p ${tmp} python3 rmats.py \ --b1 ./${i}_bam.txt \ --b2 ./A1_bam.txt \ --gtf/Homo_sapiens.GRCh38.104.chr_patch_hapl_scaff.gtf \ -t single \ --readLength 150 \ --nthread 4 \ --od ${output} \ --tmp ${tmp}

done

I'm not sure how much the result of deduplication affects the result of variable cropping. Should I still use the compared Bam file instead of the deduplicated Bam file? I have reviewed your paper and found that the exit inclusion level ψ It is indeed related to the count count, so I am not sure how much the result of deduplication affects the detection of variable cropping. I should trust which result. Thank you for your answer.

sunta3iouxos commented 7 months ago

I just seen this post and I am almost certain that if your data are mRNA, then it is not appropriate to remove replicates. The reason is that the true seq kit and other similar protocols, in my knowledge, do not sonicate or do anything similar to the RNA/cDNA so you would expect to have multiple similar events in for example close to the TSS or begining of exons, etc. I know that there is a PCR bias, but if you have not included UMIs it is not advisable to remove the dublicates, since there is no way to figure the true dublicates reads from the PCR derived.