GreenleafLab / ArchR

ArchR : Analysis of Regulatory Chromatin in R (www.ArchRProject.com)
MIT License
384 stars 137 forks source link

Using Peak-Gene Assignments to Calculate Gene Score #546

Closed emmawwinchester closed 3 years ago

emmawwinchester commented 3 years ago

A problem we've been running into with our data is that the default distance from the tss used to calculate gene score genome-wide is not reliable in non-terminally differentiated cells. The problem we are seeing is that the default works fine for some genes, but this average distance is not great in non-terminally differentiated cells, such as embryonic stem cells. Even changing around the parameters to different distances flanking the TSS isn't reliable due to the fact that every gene lies in a different genomic context, surrounded by other genes at varying distances that may or may not also be accessible and also have active enhancers nearby.

We've fiddled with the various settings, and we think the best way to overcome these problems would be to first assign peaks to genes (either using archr, or using ABCenhancergene or a similar program), then use these assignments/loops as a factor in the prediction of the gene score, instead of relying on the metric of accessibility within a certain distance of the gene. This would take into account the accessibility at the tss of the gene in addition to known biological connections between regulatory elements and the tss.

The idea would be to use addGeneScoreMatrix, with the option geneModel=usePeakToGene(project), or something along those lines.

Is this something that would be possible to add? We've looked into adding our own patches to rig it up for our own uses, but haven't been able to thus far. Thank you all in advance.

jgranja24 commented 3 years ago

Hi @emmawwinchester, the main utility of Gene Scores is to identify biological labels associated with clusters based on known marker genes. This method isnt perfect, but it does work surprisingly well for lots of marker genes. In regards to your question, this really isnt possible to use in this manner at the moment because of how the implementation is set up. The best thing i can imagine is splitting the peaks into groups (based on the linked gene assignment) and computing module scores (see https://github.com/GreenleafLab/ArchR/issues/308). --Screenshot from that issue

Screen Shot 2021-02-23 at 8 18 10 PM