FredHutch / SEACR

SEACR: Sparse Enrichment Analysis for CUT&RUN
GNU General Public License v2.0
104 stars 45 forks source link

Compare my CUT&TAG data between macs2 and SEACR #35

Open leafyeszjl opened 4 years ago

leafyeszjl commented 4 years ago
Hi, I tried running SEACR and Macs2 on my CUT&Tag data using different parameters。The following table 1 were the results that SEACR with or not with IgG sample S42, norm or non parameters。I am confused that why the peaks number were increasing a lot when using IgG sample as control ? The following table 2 were the results that using Macs2 with the commond 'macs2 callpeak -t S11.bam -g hs -f BAMPE -n S11 -q 0.05 --keep-dup=all -B --SPMR' . It is obvious that when using IgG sample as control and 'norm' option ,the peaks number were much more than that of Macs2 . Just as the article described SEACR could avoid false-positive peaks, I am wondering why the peaks number were much more that Macs2. Can you help me solve my doubts? (1) Table 1 : SEACR test results Sample Stringent_IgG0_non Stringent_IgG0_norm Stringent_IgGS42_non Stringent_IgGS42_norm
S11 12507 12507 9426 146194
S38 14257 14257 133433 136190
S39 11848 11848 138728 129764
S42 5 34    
S5 12058 12058 125419 127042
S6 13634 13634 115781 113710

(2) Table 2:Macs2 test results

Sample Macs2
S11 674
S38 21800
S39 34541
S42 0
S5 48738
S6 54388
mpmeers commented 4 years ago

Hi,

To what read depth were the two IgG control samples sequenced? The fact that the S42 experimental sample yields very few peaks suggests there may be an issue with the S42 sample itself, and I have found that there can be issues with SEACR when the control sample read depth is much lower than the target sample read depth, for instance.

Mike

leafyeszjl commented 4 years ago

Hi Mike,

Thank you very much. The reads pair in the S42 sample was indeed very low. Do you know what read depth or read pairs are appropriate for peak calling ?

Sample S11 S38 S39 S42 S5 S6
Clean reads pair 2521993 197705953 7745098 4360 8416998 6707812
Mapped reads pair 2367278 184952503 7042634 3995 8032654 6505496
Mapping rate(%) 93.87 93.55 90.93 91.63 95.43 96.98
Duplicate reads 929968 244465240 2043174 292 1469808 1289332
Duplicate rate(%) 19.64 66.09 14.51 3.65 9.15 9.91

Jialin

leafyeszjl commented 4 years ago

Hi Mike, Do you think it is necessary to cut the same bases or reads to call peak for sample comparing? Jialin

leafyeszjl commented 4 years ago

Hi Mike, I am sorry for troubling you so many times. I have changed my CUT&Tag data for this analysis. In this experiment, L9 was the IgG sample, and L1\L2\L3\L4 sample were just different in cells input (100,1k,1w and 10 w respectively). Other experimental conditions are the same. The alignment results was as table 1 , the peaks number calling by SEACR was as table 2 and the TSS heatmap was showing as Fig1. I am surprised that peaks number of L3 was so high. Why? Thank you for your reply. (1)Table 1: Alignment results

Sample L1 L2 L3 L4 L9
Clean reads pair 23352501 14016997 8588937 12006095 7104387
Mapped reads pair 21165340 12074669 7180872 11589712 6476417
Mapping rate(%) 90.63 86.14 83.61 96.53 91.16
Duplicate reads 40234544 22294654 10839854 6596992 11982356
Duplicate rate(%) 95.05 92.32 75.48 28.46 92.51
chrMT reads 570956 401178 397724 431596 1452286
chrMT rate(%) 1.35 1.66 2.77 1.86 11.21
(2)Table 2: Peaks number calling by SEACR, with the command SEACR_1.3.sh L4_fragments.bedgraph L9_fragments.bedgraph norm stringent Sample peaks_number
L1 4167
L2 4255
L3 255226
L4 1698

(3) Fig1:TSS heatmap 图片1

jialin

mpmeers commented 4 years ago

Hi Jialin,

It's hard for me to know exactly why it may have called so many peaks without having a look at the data myself, but in general SEACR gives the best results when the IgG is from the same conditions as the target experiment (same cell type, cell number, similar read depth, etc.), since the IgG is meant to provide an estimate of the natural background under the same conditions in which the experiment was conducted. This is relevant to your first example with S42, since that sample (and I presume the S42 IgG you used) was dramatically undersequenced. As for the second example, the extremely high duplicate rate across most samples makes it difficult to know whether your issue is with peak calling or with underlying data quality. I'm a little confused by your numbers too since you're reporting duplicate reads, but the number of duplicate paired end fragments (i.e. where both pairs of reads are identical) is the information that's more relevant to SEACR's function. It may be worth trying to filter out duplicates fragments (not duplicate individual reads) and rerunning it, but it's tough for me to recommend anything else without seeing a sample of the data.

Mike