ENCODE-DCC / atac-seq-pipeline

ENCODE ATAC-seq pipeline
MIT License
383 stars 171 forks source link

Interpreting result files and selecting files for visualization and DAR analysis #413

Open Rafaelsoler13 opened 1 year ago

Rafaelsoler13 commented 1 year ago

Dear developers,

Firstly, I would like to express my appreciation for the detailed ATAC-seq pipeline that you developed. As a PhD student, I find this tool extremely helpful in analyzing ATAC-seq data. However, as someone who is still not 100% familiarized with all the terms and concepts in this field, I have a few questions that I would appreciate your help with.

Firstly, I am struggling to interpret some of the result files. Is there a tutorial or documentation that can explain each result file in more detail?

In addition, I am also unsure about which files are the final ones for visualization purposes, such as bigWigs or bam files, and which ones are ideal for posterior analysis using csaw. I would greatly appreciate it if you could provide some guidance on this matter.

Lastly, I recently ran the example subset data test ENCSR356KRQ_subsampled to check if the pipeline is running without any issues. Could you provide an updated html as a positive control to validate that everything worked fine? Or is there an output message that confirms that everything went okay? This is my QC file.

qc.html.txt

Best,

Rafael

sufyazi commented 1 year ago

Not one of the developers but I can try helping out. First of all, you have to run croo on the metadata.json output. It helps a lot in sieving through the opaque, various output into digestible, even visualizable bits. The README.md has a section that you can follow along.

After running croo you will find four major directories in the output, namely align, peak, qc, and signal. They are descriptive and self-explanatory: the bam file in align is the aligned reads file you would use for any downstream analysis, the narrowPeak.gz files in peak (multiple ones depending on whether you turn idr analysis on or off - if I am not mistaken this is possible by customizing the config JSON) are the output of MACS2 and represent the statistically called regions of interest (in this case, open chromatin) and what you will use for any downstream analysis like with csaw. qc is just for QC visualization, and signal contains the BigWig files, which are the continuous format data files for use in IGV to visualize read pileups. Think of it as the processed version of bam files, I guess.

I am not sure if there is any kind of 'best practice' validation step post-pipeline run, but I personally just trust the Cromwell std err output, so if I see the line Workflow: id=27336d51-bc86-46a6-8d59-bff543d8d73d, status=Succeeded (the id value here will be random every run as it is the workflow UUID) and Cromwell finished successfully. I consider the pipeline run a success. It's always good to do some sanity check by visualizing the output bigwig files and checking the QC html (just open the file on your browser of choice) for things that are out of the ordinary (which I won't get into here - you can read papers like the one from Yan et al. 2020. You asked for a 'positive control' html example, but I don't think such thing is possible because there isn't really one 'correct' atac-seq pipeline output.

By the way, you can find the visualization of the workflow in an svg file generated by croo. Below is one I grabbed from my run to demonstrate how helpful it is.

example visualization of the workflow

As for which IDR-processed output files (narrowPeaks) to use for your downstream analysis, that's something you might need to decide yourself depending on how many biological replicates you did, the nature of the experiment/study, etc. Might be worth discussing with your supervisor. You can also glean what 'overlap', conservative and optimal set means in the pipeline context by checking out the specification here.

Hope this helps at least.

Rafaelsoler13 commented 1 year ago

Thank you for the detailed explanation. It's always a pleasure read answers like this one =)

Rafaelsoler13 commented 10 months ago

Sorry to reopen this issue, but I was still wondering which file I should use for DAR analysis. Should I use the narrowPeak file in the peak/ folder? Or should I use Overlap Peaks / IDR Peaks? Or are they just to check the reproducibility of the peaks? @sufyazi

Thank you!

sufyazi commented 10 months ago

I think the narrowPeak files in the peak folder should be fine to use, though I personally just merge the overlap and IDR peak files and use these instead in my analysis. They are just more stringent peaks. But it might depend on what you want to do/see in your analysis so your mileage may vary and you might be fine just using the less stringent, background-filtered (bfilt) peak set file.