Does the calculation of sval in encode_task_macs2_signal_track_chip.py take into account the difference between single-end and paired-end data?

ezioljj commented 1 year ago

I am in the process of using your innovative scripts for ChIP-seq data processing (many datasets containing both single-end and paired-end). While going through the pipeline, I noticed that there doesn't appear to be a clear distinction made between single-end ChIP-seq data and paired-end ChIP-seq data during sval calculation (encode_task_macs2_signal_track_chip.py, line 145). In my understanding, the tagAlign files generated by this pipeline might differ between single-end and paired-end data.

For single-end tagAlign files, the read count should be equal to the total number of lines in the file, while for paired-end tagAlign files, the read count should be half of the total number of lines in the file. However, I'm not entirely certain if my interpretation is correct. I would greatly appreciate your assistance in clarifying this matter. Thank you once again for creating such valuable tools.

akundaje commented 1 year ago

We process SE and PE reads as SE with MACS2. i.e. for PE, we use both ends of the fragment as separate reads. For SE, just the one end. MACS2 is not run in paired-end mode for PE reads. Both ends of the fragment are present in the tagAlign and presented to MACS2 in SE mode.

So sval for SE and PE data would based on counting all reads (for PE both ends of each frag and for SE one end).

sval = min(no. of chip reads, no. of control reads) / 1e-6

On Fri, Sep 1, 2023 at 1:57 AM ezioljj @.***> wrote:

I am in the process of using your innovative scripts for ChIP-seq data processing (many datasets containing both single-end and paired-end). While going through the pipeline, I noticed that there doesn't appear to be a clear distinction made between single-end ChIP-seq data and paired-end ChIP-seq data during sval calculation (encode_task_macs2_signal_track_chip.py, line 145). In my understanding, the tagAlign files generated by this pipeline might differ between single-end and paired-end data.

For single-end tagAlign files, the read count should be equal to the total number of lines in the file, while for paired-end tagAlign files, the read count should be half of the total number of lines in the file. However, I'm not entirely certain if my interpretation is correct. I would greatly appreciate your assistance in clarifying this matter. Thank you once again for creating such valuable tools.

— Reply to this email directly, view it on GitHub https://github.com/ENCODE-DCC/chip-seq-pipeline2/issues/298, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDWEMQ2PZDM6VQKA3MKB3XYGPQDANCNFSM6AAAAAA4HIANUY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ezioljj commented 1 year ago

Thank you for your explanation. I have two additional questions I would like to ask:

All paired-end (PE) and single-end (SE) data will be treated as single-end (SE). What is the reason for this? Is it because you intend to streamline the analysis pipeline for both SE and PE data in a larger project that contains both SE and PE datasets? I also assume that during certain processing steps, such as removing reads in a blacklist or retaining reads within mappable regions, some reads may be discarded or retained without maintaining their paired relationships (maybe this situation occurs rarely, so we just do not consider this as a problem?)
In your opinion, what would be a better or more suitable approach for calling peaks with MACS2: using fragment or reads? In my understanding, using reads can get two peaks (need merge) and using fragments only get one peak for each binding position.

ENCODE-DCC / chip-seq-pipeline2

Does the calculation of sval in encode_task_macs2_signal_track_chip.py take into account the difference between single-end and paired-end data? #298