ATAC QC Filtering Tiling Regions

astarr97 commented 3 years ago

Hello,

We did a very shallow sequencing run of ~8 million reads (before mapping etc.) spread over 20 bulk atac samples from 5 different cell types with 3 "treatments" each (notably, our qubit seems to have been quite off so the reads are not very evenly distributed across the samples). I have been using ChrAccR to do some quality control on the data. Things look decent (nowhere near as nice as the T-cell data provided in terms of peak distribution though) except for one glaring difference between the two. In my data, we only retained 5710 50 bp tiling regions after filtering (compared to ~53,000 in the T-cell data). In addition, the initial filtering step (with the 95% cutoff) only discarded 77% of regions in my dataset (compared to ~99% in the T-cell dataset). The code I am using to run both datasets is identical except for the names of columns in the samples.tsv file. Because there are likely differences in sequencing depth and the T-cells probably have much more similar chromatin than our cell types, a direct comparison seems tricky. Essentially I am wondering: is this level of tiling region retention in line with what you would expect? This is the lab's first time doing ATAC-seq so we don't really have much idea; any help would be much appreciated.

Best, Alex.

PS So far my experience with ChrAccR has been very good, surprisingly painless to install and get running.

demuellae commented 3 years ago

Hi Alex, Thanks for using ChrAccR. Without seeing the actual data, it's hard to tell whether what you observing is to be expected. Here's a non-exhaustive list of questions you might ask in order to judge this better in your case: What do the QC statistics tell you? Do you have sufficient coverage? Do you see a lot of duplicate reads. What do the TSS enrichment plots and fragment size distribution tell you?

astarr97 commented 3 years ago

Thanks for the quick response! We will probably go ahead with sequencing but I have a few more questions. Are the TSS Enrichment values computed directly comparable to the ENCODE guidelines (where 5-7 is considered acceptable for alignment to hg38)? Also, what is (roughly) the expected library size for a good ATAC-seq run? Should we be concerned about libraries with a very large size/low number of duplicates? Probably most importantly, our data generally only weakly shows the first nucleosomal peak (see attached image). My understanding is that we should be able to call open chromatin relatively well from this, but not do things like TF footprinting or determining nucleosome positioning. Is this true in your experience? Finally, what's the best resource for understanding the methods used by ChrAccR to compute QC metrics?

upload

GreenleafLab / ChrAccR

ATAC QC Filtering Tiling Regions #2