ENCODE-DCC / chip-seq-pipeline2

ENCODE ChIP-seq pipeline
MIT License
241 stars 123 forks source link

Changing MACS2 output from earlier pipeline version #291

Open clarabakker opened 1 year ago

clarabakker commented 1 year ago

I am currently updating an adaptation of this pipeline from a version forked from v1.1.2 to v2.1.6. Very minor changes were made in both adaptations for the sake of metadata/display, without modifying the analyses performed. Given that, however, my question is a bit more general.

The phenomenon I am observing is that the optimal and conservative peaks generated by MACS2 are often identical for pipeline version 2.1.6. I read in the documentation that the optimal peaks are selected as the set with the most peaks, while the conservative set is selected as the set of true replicate consistent peaks, and can see that these are the same set. However, I was not expecting that the true replicate peaks would be the largest set.

To give a concrete example, running the old and new pipeline on the same set (from fastq files), the choice of conservative peaks changes from the ppr to rep1-rep2 set from v1.1.1 to v2.1.6. The full QC metrics are available here (v1.1.1 and v2.1.6) and the dataset is available at https://data.4dnucleome.org/experiment-set-replicates/4DNESSNWXHXK/. The control experiment was similarly processed starting from fastq files with old and new pipeline versions, respectively.

All this is to ask if there is some change (perhaps in the method of generating/number of pseudoreplicates) that might explain this difference in output, or if it is more likely there is a flaw in the new implementation. I know v1.1.1 is quite old at this point, which I apologize for. I would be grateful for any insight you might have and happy to provide more information should it be helpful.

leepc12 commented 1 year ago

Looking at these two QC reports. I need to see QC metrics for alignment steps too. Also, please post your input JSON file for those two runs. Also, input JSON for the alignment process too if possible. We changed the way how Strand cross-correlation measures (xcor step) estimates fragment length. I think this might be the major reason for the difference.

clarabakker commented 1 year ago

Thank you so much--sorry for the delay! Here are the two QCs for alignment for this set using v1.1.1 (1 and 2) and the two using v2.1.6 (3 and 4).

The input JSONs for the two runs in the original question: v1.1.1 post-alignment JSON and v2.1.6 post-alignment JSON.

The input JSONs for v1.1.1 alignment are here: 1 and 2. However, the v2.1.6 alignment input JSON is a bit difficult to retrieve at the moment. I can update with it if it would be helpful--I just don't want to take too much of your time.