GreenleafLab / ArchR

ArchR : Analysis of Regulatory Chromatin in R (www.ArchRProject.com)
MIT License
384 stars 137 forks source link

Questions about differential peak calling #470

Closed aselewa closed 3 years ago

aselewa commented 3 years ago

Hi,

I was wondering how exactly is the differential accessibility performed? The documentation (as far as I can see) is somewhat lacking in this respect. I tried reading the code but its not clear to me as I am not an expert in R.

In particular, I am wondering if getMarkerFeatures does one cell-type vs. all the other cell-types per peak, or does it do differential peak testing between every pair of clusters and pick the best? Would be nice to know how the Log2FC is calculated. Also is the Wilcoxon test done on single-cell measurements or are they aggregated in some way? If so, I imagine theres many zeros due to sparsity, how do you break ties in the rank sum test?

Thanks!

rcorces commented 3 years ago

This section from the manuscript should answer some of your questions. Let us know if now.

ArchR Methods – Marker Peak Identification ArchR allows for identification of features that are highly specific to a given group/cluster to elucidate cluster-specific biology. ArchR can identify these features for any of the matrices that are created with ArchR (stored in the Arrow files). ArchR identifies marker features while accounting for user-defined known biases that might confound the analysis (defaults are the TSS enrichment score and the number of unique nuclear fragments). For each group/cluster, ArchR identifies a set of background cells that match for the user-defined known biases and weights each equivalently using quantile normalization. Additionally, when selecting these bias-matched cells ArchR will match the distribution of the other user-defined groups. For example, if there were 4 equally represented clusters, ArchR will match the biases for a cluster to the remaining 3 clusters while selecting cells from the remaining 3 groups equally. By selecting a group of bias-matched cells, ArchR can directly minimize these confounding variables during differential testing rather than using modeling-based approaches. ArchR allows for binomial testing, Wilcoxon testing (via presto, https://github.com/immunogenomics/presto/), and two-sided t-testing for comparing the group to the bias-matched cells. These p-values are then adjusted for multiple hypothesis testing and organized across all group/clusters. This table of differential results can then be used to identify marker features based on user-defined log2(Fold Change) and FDR cutoffs.

aselewa commented 3 years ago

Thanks, this clarifies things a little. So is it correct to say that the log2FC is done between all cells in my target cluster, and all the matched background cells? Also if the clusters are not equal in number of cells, are the background cells sampled proportionally from each background cluster?

rcorces commented 3 years ago

So is it correct to say that the log2FC is done between all cells in my target cluster, and all the matched background cells?

Well, the maxCells parameter to getMarkerFeatures() dictates how many cells from the group.

Also if the clusters are not equal in number of cells, are the background cells sampled proportionally from each background cluster?

Yes - I believe so.