GreenleafLab / ArchR

ArchR : Analysis of Regulatory Chromatin in R (www.ArchRProject.com)
MIT License
384 stars 137 forks source link

GeneActivityScore region calculation for overlapping genes #1535

Open bnprks opened 2 years ago

bnprks commented 2 years ago

ArchR's .addGeneScoreMat function incorrectly calculates what region flanking a gene to include in the gene score in the presence of fully overlapping genes.

This issue affects about 16% of genes in the default ArchR Hg38 annotation. (I suspect this high count is due to the inclusion of many miRNA in the default annotation)

An example for debugging is the gene SYN2, which fully encloses the gene TIMP4 in ArchR's Hg38 gene annotation

To illustrate what happens, consider a pair of genes, one inside of the other:

                  |-------------------------|
                                  |---|

ArchR will extend these gene regions as follows:

        ++++++++++|-------------------------|+
                                 +|---|++++++++++

This is because .addGeneScoreMat finds neighboring genes by sorting based on start coordinate. So the outer gene sees the inner gene as its right-hand neighbor, and the inner gene sees the right-hand neigbor of the outer gene as its right-hand neighbor.

What to do in the case of overlapping genes is a bit of a judgement call, but I'd propose extending as follows:

        ++++++++++|-------------------------|++++++++++
                                 +|---|+

This could be easily implemented by removing fully enclosed genes prior to running the existing sort-based neighbor finding logic, and setting all fully enclosed genes to have the minimum flanking extension.

rcorces commented 2 years ago

Hi @bnprks! Thanks for using ArchR! Please make sure that your post belongs in the Issues section. Only bugs and error reports belong in the Issues section. Usage questions and feature requests should be posted in the Discussions section, not in Issues.
Before we help you, you must respond to the following questions unless your original post already contained this information: 1. If you've encountered an error, have you already searched previous Issues to make sure that this hasn't already been solved? 2. Can you recapitulate your error using the tutorial code and dataset? If so, provide a reproducible example. 3. Did you post your log file? If not, add it now. 4. Remove any screenshots that contain text and instead copy and paste the text using markdown's codeblock syntax (three consecutive backticks). You can do this by editing your original post.

rcorces commented 2 years ago

Thanks for the thorough report. We will fix this.