cfce / chilin

ChIP-seq DC and QC Pipeline
Other
34 stars 12 forks source link

Can't repeat the result of Cistrome Data Browser #10

Closed QianzhaoJ closed 5 years ago

QianzhaoJ commented 5 years ago

Thanks for developing Cistrome Data Browser , it is a very useful tool.

I'm found SIRT3 ChIP-seq in Homo sapiens using this Browser. And I'm very interesting in this protein. I got BED peaks from your Browser, meanwhile I downloaded the raw data from GSM1006653 and processed using my own pipeline(bowtie2+samtools+MACS2),but I found I can't repeat the result of Cistrome Data Browser (the browser's BED peaks contain 100 peaks but mine contain 3 peaks). Then I found Cistrome Data Browser using your chilin pipeline (bwa+samtools+MACS2), and I used similar options with source code , but I failed to repeat the result too(my peak file contain 10 peaks).

Here are the parameters I used , Could you give me some suggestions? Thanks in advance !

Mapping&&Sam2bam bwa aln -q 5 -l 32 -k 2 -t 20 $index $clean1 > ${sample}.sai bwa samse $index ${sample}.sai $clean1 >${sample}.sam awk '$3!="chrM" && $3!="chrY"' $sam | samtools view -@ 20 -bS -q 1 > $unique_bam samtools sort -@ 20 -l 9 $unique_bam -o $srt_bam Calling peak macs2 callpeak --SPMR -t $sample_t -c $sample_c -f BAM -g hs -B -q 0.01 -n $sample --outdir $out

And before these steps , I used FastQC and trim_galore with default setting

cfce commented 5 years ago

GSM3394755 is not SIRT3 ChIP-seq. If you search Cistrome DB for SIRT3 ChIP-seq, there are 4 samples all from Iwahara T, et al. SIRT3 functions in the nucleus in the control of stress-related gene expression. Mol. Cell. Biol. 2012 and they all have ~100 peaks.

GSM3394755 is LMNB1 ChIP-seq from GEO with an unpublished study (GSE109924, although none of the 34 samples under this GSE are related to SIRT3). LMNB1 has broad enrichment in heterochromatin, so its QC metric will not meet the standard TF ChIP-seq data.

Thanks!

X. Shirley Liu Department of Data Science Center for Functional Cancer Epigenetics Dana-Farber Cancer Institute and Harvard University CLS11022, 3 Blackfan Circle, Boston 02115 300a Science Center, 1 Oxford Street, Cambridge, 02138

On Wed, Jul 3, 2019 at 7:44 AM QianzhaoJ notifications@github.com wrote:

    External Email - Use Caution

Thanks for developing Cistrome Data Browser , it is a very useful tool. I'm found SIRT3 ChIP-seq in Homo sapiens using this Browser. And I'm very interesting in this protein. I got BED peaks from your Browser, meanwhile I downloaded the raw data from SRA and processed using my own pipeline(bowtie2+samtools+MACS2),but I found I can't repeat the result of Cistrome Data Browser (the browser's BED peaks contain 100 peaks but mine contain 3 peaks). Then I found Cistrome Data Browser using your chilin pipeline (bwa+samtools+MACS2), and I used similar options with source code , but I failed to repeat the result too(my peak file contain 10 peaks). Here are the parameters I used , Could you give me some suggestions? Thanks in advance !

Mapping&&Sam2bam bwa aln -q 5 -l 32 -k 2 -t 20 $index $clean1 > ${sample}.sai bwa samse $index ${sample}.sai $clean1 >${sample}.sam awk '$3!="chrM" && $3!="chrY"' $sam | samtools view -@ 20 -bS -q 1 > $unique_bam samtools sort -@ 20 -l 9 $unique_bam -o $srt_bam Calling peak macs2 callpeak --SPMR -t $sample_t -c $sample_c -f BAM -g hs -B -q 0.01 -n $sample --outdir $out

And before these steps , I used FastQC and trim_galore with default setting

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cfce/chilin/issues/10?email_source=notifications&email_token=AB6LHCUS6EAYKMYGMGU7BMLP5SGJPA5CNFSM4H5E5KBKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G5DSN3A, or mute the thread https://github.com/notifications/unsubscribe-auth/AB6LHCSUYBTIS3Y5HRY54KDP5SGJPANCNFSM4H5E5KBA .

QianzhaoJ commented 5 years ago

Thanks for your prompt reply ! I'm sorry I typed the wrong GEO numbers, it should be GSM1006653, also be the article you mentioned. About GSM1006653 , I used the scripts mentioned above but I failed to repeat the result of Cistrome Data Browser. Could you give me some suggestions on these scripts? Thanks a lot !

qinqian commented 5 years ago

Thanks for the feedback. We have fully reproduced the results for GSM1006653. There are some details to clarify: first, we did not use input or control samples for calling peaks, and we analyze each of the biological replicates independently. Please follow the instructions below to reproduce the peaks downloaded from our website.

The implementation details include the usage of the ensemble (or ncbi) primary assembly fasta and build the genome index, the software version we used (bwa 0.7.10, macs2 2.1.0.20140616), software parameters (samtools mapping quality 1, macs2 should remove duplicated reads).

wget -c ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/dna//Homo_sapiens.GRCh38.dna.toplevel.fa.gz
   zcat Homo_sapiens.GRCh38.dna.toplevel.fa.gz | sed -e "s/>/>chr/g" > Homo_sapiens.GRCh38.dna.toplevel.fa 
   bwa index -a bwtsw Homo_sapiens.GRCh38.dna.primary_assembly.fa

bwa aln -q 5 -l 32 -k 2 -t 24 Homo_sapiens.GRCh38.dna.primary_assembly.fa test3_treat_rep1.fastq > test3_treat_rep1_all.sai

bwa samse Homo_sapiens.GRCh38.dna.primary_assembly.fa test3_treat_rep1_all.sai test3_treat_rep1.fastq > test3_treat_rep1.sam

samtools view -bS -q 1 test3_treat_rep1.sam > test3_treat_rep1.tmp.bam 
samtools sort -m 4000000000 test3_treat_rep1.tmp.bam test3_treat_rep1
macs2 callpeak --SPMR -t test3_treat_rep1.bam -f BAM -g hs -B -q 0.01  --keep-dup 1 -n test --outdir output

The overall QC of this dataset is bad, use it with caution. The current version of ChiLin recommends using modeled extension size for calling narrow peaks, we used a fixed extension size for both narrow and broad peaks for large-scale dataset collections.