ENCODE-DCC / atac-seq-pipeline

ENCODE ATAC-seq pipeline
MIT License
380 stars 171 forks source link

Questions about output files from ATAC-seq pipeline #299

Open Kyung-TaeLee opened 3 years ago

Kyung-TaeLee commented 3 years ago

Hi, first of all, thank you for providing a wonderful tool. I ran the ATAC-seq analysis using the pipeline on data as shown below

  1. control -> no replicate
  2. sample -> 2 biological replicate.

Analysis was finished successfully and have questions regarding the output files generated

Q1. What is the output file that can be used for analysis of differential usage of promoter between control and sample? Control was run without replicate and sample was run with 2 biological replicates

Q2. In the section "ATAC-seq Data Standards and Processing Pipeline" on the webpage of ENCODE, "The number of peaks within an IDR peak file should be >70,000, though values >50,000 may be acceptable" is specified in Current Standards section. Can you explain what is "IDR peak file"? Does this number related with the numbers specified for "N optimal" or "N conservative" in "Reproducibility QC and peak detection statistics" table? If not, can you please explain what do the numbers specified for "N optimal" or "N conservative" in "Reproducibility QC and peak detection statistics" table mean? (table below)

캡처

Thank you and looking forward to your reply

leepc12 commented 3 years ago

Sorry about late response.

Q1. How did you run pipelines for controls? Unlike our ChIP-seq pipeline, ATAC-seq pipeline does not support controls.

Q2. So the pipeline calls peak (with MACS2) on each replicate and then IDR analysis is done on every pair of MACS2 peaks (e.g. rep1.narrowPeak.gz vs rep2.narrowPeak.gz). This is also done for pooled replicates. Among these IDR peaks, the best one is chosen based on different criteria (optimal/ conservative).

For unreplicated experiment, peaks are called on each pseudo-replicate (original reads are randomly shuffled and splitted into 2 pseudos) and then IDr analysis is done for two peaks (rep1-pr1.narrowPeak.gz vs rep1-pr2.narrowPeak.gz). For such case Nt and Np are always zero and N1 is the final IDR peak since there is only one IDR peak for unreplicated case.