Running addDeviations but constrained within groups, rather than using a background across all groups.

GreenleafLab / ArchR

ArchR : Analysis of Regulatory Chromatin in R (www.ArchRProject.com)

MIT License

378 stars 135 forks source link

Running addDeviations but constrained within groups, rather than using a background across all groups. #530

Closed markphillippebworth closed 3 years ago

markphillippebworth commented 3 years ago

Do not use this form to report a bug in ArchR! Instead, use the "Bug report" option.

PLEASE FILL OUT THE RELEVANT INFORMATION AND DELETE THE UNUSED PORTIONS OF THIS ISSUE TEMPLATE.

Describe the problem that your feature request would address. It would be great to be able to run addDeviations in a group-wise fashion. For example, in a pair-wise manner for treated vs untreated mouse line, or across longitidunal data from one individual. Right now, I'd have to subset each grouping I want to do, which would mean completely recopying my ArchRProject. That takes incredible amount of hard drive space, and is very slow, computationally.

Describe the solution you'd like The alternative would be to give it a metadata column with groupings for constrained. For each group, addBgdPeaks would be run, and the deviations calculated for each arrow within the group using group-matched BgdPeaks. This would let us leverage experimental design to control for individual variation in motifs, and let us see only longitidinal or treatment effect on motif usage. Essentially, I get to normalize by individual this way.

Describe alternatives you've considered I'm going to have to create a new ArchRProject for each individual, and run addDeviations, and export the motif matrix. Then combine them for visualization. This is going to take forever to copy because I'm working with over 30 individuals, and 3 time points per individual.

Additional context

rcorces commented 3 years ago

Hi @markphillippebworth . Thanks for posting and using the Issue Template. I'm not sure I fully understand the use case so I'll ask some clarifying questions.

First, the deviations are dependent only on the peakAnnotation. So in the case of the tutorial code, we add a peakAnnotation object for cis-bp motifs. The deviations obtained for this peakAnnotation are calculated on a per-cell basis by the addDeviationsMatrix() function and are independent of any comparisons you would like to do downstream. I dont believe that a "treated vs untreated" comparison is relevant here. The function addDeviationsMatrix() only depends on the peakAnnotation object.

Perhaps some of my confusion is from your reference to "addDeviations" which isnt a function in ArchR. Maybe you meant to refer to something else? Can you clarify? I'm not sure why you need to create a new ArchRProject rather than just re-running the individual steps on the same ArchRProject. Sorry if I'm missing something.

markphillippebworth commented 3 years ago

Hi @rcorces, Thank you for responding so quickly!

Yes, I am talking about addDeviationsMatrix(). I thought addDeviationsMatrix relied on addBgdPeaks(), and generated z scores as compared to match peaks across the entire project.

If you run addDeviationsMatrix() on an entire PBMS dataset, then it'll pull background peaks across multiple cell types for comparison to each arrow, which also contains multiple cell types. Lots of variation.

Now imagine that an ArchRProject is now subset to only a specific cell type, like NK cells. addBgdPeaks will now pull from peaks that are only accessibility in NK cells, and most variability in peak accessibility (aside from stochasticity) should be diseased vs control status. Furthermore, each arrow will be either a diseased sample of NKs cells, or a healthy sample of NK cells. So we're effective calculated motif deviations of an arrow file with only diseased (or only healthy) NKs cells against a background set of peaks from healthy and disease NK cells.

Does this make sense, or did I misunderstand something?

rcorces commented 3 years ago

Thanks for clarifying. I understand now.

Yes - you are correct that addDeviationsMatrix() depends on the background peaks which in turn depends on your peakSet object and the accessibility across different cells in those peaks.

I give my 2 cents below but @jgranja24 is really the one who should answer this.

I am not the most familiar with this part of the code but I think one way to do this would be to change your peakSet to an NK-specific peak set and then recompute bgdPeaks and then re-run addDeviationsMatrix() for each comparison you want to do. This wouldnt be exactly the same as what you are doing (because the bgdPeaks would still be computed using all cells instead of just NK cells). One option would be to add an argument to addBgdPeaks() that lets you use only a subset of the cells in the project. In that case, I think we would be recapitulating exactly your subset workflow right? The bigger the change to the code base, the less likely it will be implemented given bandwidth etc so if you can think of easy ways to make this happen, let us know.

I'd be curious to know how much this affects the results at the end of the day. Have you checked?

@jgranja24 - any thoughts?

markphillippebworth commented 3 years ago

@rcorces Yes, that'd be pretty close actually. Being able to determine which arrows to run addDeviationsMatrix() would also be great. Then I could manually match bgdPeaks to addDeviationsMatrix and iterate over groups myself, instead of requesting you to add that implementation in the code base.

I haven't checked by running it manually. I'm working with patient data, so I'll need to see how significantly different background peaks are between patients, and how the motifs change.

I'd appreciate @jgranja24 thoughts too.

jgranja24 commented 3 years ago

Hi @markphillippebworth,

I see what you are asking, but I think you may be a bit confused by how chromVAR works. The background peaks are GC-matched and average accessibility matched peaks. chromVAR "z-score" is independent for each cell (for a given cell it represents the (observed - expected) / expected accessibility). Therefore, I am not following how the variation will affect across samples. The only major thing that would affect chromVAR is the selection of peaks being used, but the biological result shouldnt change tbh. Maybe I am misunderstanding your question still, but I would just calculate them using all cells and doing your comparisons then. The variability ranking in chromVAR is simply the rowVars of the z-score matrix, so you can just subset cells post chromVAR analysis to do this ranking. I hope that helps! Please let us know if we are misinterpreting your goal!

markphillippebworth commented 3 years ago

@jgranja24 - Thank you for responding. I guess my goal is to change the background peakset to control for specific variables.

Z-scores are calculated for each cell (independently from other cells), but they are still dependent on the background peakset, which is generated from the PeakMatrix, which is dependent on celltypes within a given ArchR project. When that background peakset has a broad set of peaks with many different motifs across multiple cell types, any given cell will have motif enrichement for celltype-specific, process-specific, or condition-specific TF usage. If I limit that background peakset to only regions common to NKs cell processes, then even if you choose many GC-matched peaksets, they will contain motifs important to NK cells (if enough peaks are sampled).

In otherwords, If the background peakset is drawn from a subset of peaks common to NK cells (after GC-matching), then would that remove biology related to background NK processes? The expected accessibility for a given NK-celltype motif would be close to the observed motif accessibility in each NK cell when given an NK-cell specific background. In contrast, a TF motif relevant to an NK's response to disease would be significantly enriched in a NK cell responding to disease when compared to a background set of peaks from NK cells in general (or a peakset taken from disease and healthy conditions). My current understanding is that the background peaksets are calculated using the PeakMatrix, which is dependent on celltypes in the project. Please let me know if I have a leap in logic here. You've spent more time with ChromVar than I.

rcorces commented 3 years ago

I think your logic is fine but the argument that Jeff is making (and I also raised) is that the difference in background peaks is unlikely to change your result. Perhaps you can test this and report back. And test the suggestion I gave you. I have a feeling that you wont see much difference.

markphillippebworth commented 3 years ago

Ok. Will do.