Compare my CUT&TAG data between macs2 and SEACR

leafyeszjl commented 4 years ago

Hi, I tried running SEACR and Macs2 on my CUT&Tag data using different parameters。The following table 1 were the results that SEACR with or not with IgG sample S42, norm or non parameters。I am confused that why the peaks number were increasing a lot when using IgG sample as control ? The following table 2 were the results that using Macs2 with the commond 'macs2 callpeak -t S11.bam -g hs -f BAMPE -n S11 -q 0.05 --keep-dup=all -B --SPMR' . It is obvious that when using IgG sample as control and 'norm' option ,the peaks number were much more than that of Macs2 . Just as the article described SEACR could avoid false-positive peaks, I am wondering why the peaks number were much more that Macs2. Can you help me solve my doubts？ (1) Table 1 : SEACR test results Sample	Stringent_IgG0_non	Stringent_IgG0_norm	Stringent_IgGS42_non	Stringent_IgGS42_norm
S11	12507	12507	9426	146194
S38	14257	14257	133433	136190
S39	11848	11848	138728	129764
S42	5	34
S5	12058	12058	125419	127042
S6	13634	13634	115781	113710

(2) Table 2：Macs2 test results

Sample	Macs2
S11	674
S38	21800
S39	34541
S42	0
S5	48738
S6	54388

mpmeers commented 4 years ago

Hi,

To what read depth were the two IgG control samples sequenced? The fact that the S42 experimental sample yields very few peaks suggests there may be an issue with the S42 sample itself, and I have found that there can be issues with SEACR when the control sample read depth is much lower than the target sample read depth, for instance.

Mike

leafyeszjl commented 4 years ago

Hi Mike,

Thank you very much. The reads pair in the S42 sample was indeed very low. Do you know what read depth or read pairs are appropriate for peak calling ?

Sample	S11	S38	S39	S42	S5	S6
Clean reads pair	2521993	197705953	7745098	4360	8416998	6707812
Mapped reads pair	2367278	184952503	7042634	3995	8032654	6505496
Mapping rate(%)	93.87	93.55	90.93	91.63	95.43	96.98
Duplicate reads	929968	244465240	2043174	292	1469808	1289332
Duplicate rate(%)	19.64	66.09	14.51	3.65	9.15	9.91

Jialin

leafyeszjl commented 4 years ago

Hi Mike, Do you think it is necessary to cut the same bases or reads to call peak for sample comparing? Jialin

leafyeszjl commented 4 years ago

Hi Mike, I am sorry for troubling you so many times. I have changed my CUT&Tag data for this analysis. In this experiment, L9 was the IgG sample, and L1\L2\L3\L4 sample were just different in cells input (100，1k,1w and 10 w respectively). Other experimental conditions are the same. The alignment results was as table 1 , the peaks number calling by SEACR was as table 2 and the TSS heatmap was showing as Fig1. I am surprised that peaks number of L3 was so high. Why? Thank you for your reply. (1)Table 1: Alignment results

Sample	L1	L2	L3	L4	L9
Clean reads pair	23352501	14016997	8588937	12006095	7104387
Mapped reads pair	21165340	12074669	7180872	11589712	6476417
Mapping rate(%)	90.63	86.14	83.61	96.53	91.16
Duplicate reads	40234544	22294654	10839854	6596992	11982356
Duplicate rate(%)	95.05	92.32	75.48	28.46	92.51
chrMT reads	570956	401178	397724	431596	1452286
chrMT rate(%)	1.35	1.66	2.77	1.86	11.21

(2)Table 2: Peaks number calling by SEACR, with the command `SEACR_1.3.sh L4_fragments.bedgraph L9_fragments.bedgraph norm stringent` Sample	peaks_number
L1	4167
L2	4255
L3	255226
L4	1698

(3) Fig1:TSS heatmap

jialin

mpmeers commented 4 years ago

Hi Jialin,

It's hard for me to know exactly why it may have called so many peaks without having a look at the data myself, but in general SEACR gives the best results when the IgG is from the same conditions as the target experiment (same cell type, cell number, similar read depth, etc.), since the IgG is meant to provide an estimate of the natural background under the same conditions in which the experiment was conducted. This is relevant to your first example with S42, since that sample (and I presume the S42 IgG you used) was dramatically undersequenced. As for the second example, the extremely high duplicate rate across most samples makes it difficult to know whether your issue is with peak calling or with underlying data quality. I'm a little confused by your numbers too since you're reporting duplicate reads, but the number of duplicate paired end fragments (i.e. where both pairs of reads are identical) is the information that's more relevant to SEACR's function. It may be worth trying to filter out duplicates fragments (not duplicate individual reads) and rerunning it, but it's tough for me to recommend anything else without seeing a sample of the data.

Mike

FredHutch / SEACR

Compare my CUT&TAG data between macs2 and SEACR #35