buenrostrolab / FigR

Functional Inference of Gene Regulation
https://buenrostrolab.github.io/FigR/
MIT License
31 stars 10 forks source link

Relevance of dorcK value #37

Closed valentinaOpazo closed 6 months ago

valentinaOpazo commented 7 months ago

Hello there, thanks a lot for developing this useful package and the user-friendly tutorial. I recently run FigR on my dataset and I would like to know what is the relevance of the 'dorcK' value in the runFigRGRN function? since I found a total of 319 DORCs but I used a the default value (30) and in the tutorial you proposed set to ~3 percent of total DORCs determined.

Additionally, do you think that the windows of 100kb around TSS (in peak-gene association) coverage a enough proportion of intergenic peaks? Could be a good idea increase the windows to coverage a major proportion of peaks?

vkartha commented 7 months ago

Hi there - the relevance of this parameter is that basically it determines how many dorc genes to use to pool the peaks per DORC for (nearest neighbors based on DORC accessibility) prior to running motif enrichment - the underlying assumption being DORCs that are very close in overall accessibility space may likely be regulated by similar enhancer/TF programs. This also provides you with a larger peak n for motif enrichment (which is hard to do on an individual DORC basis especially if they have few peaks). If you're using k=30 for 319 total DORCs, that is closer to 10% so you just need to be aware you might end up with more overlap in the enrichment results (Since neighbors are more prone to overlap). How many peaks did you use as a cutoff to define what DORCs are? You could potentially be more flexible there depending on the "knee" in your DORC peak # plot (i.e. if say 5 peaks is reasonable, instead of 7 or 10, as was used in the paper). I will add that we didn't formally include tests for this paremeter as part of our work, the 3% was tailored based on some ground truth expectations for specific TFs in certain cell types. You can run for k=5, 10 and 20 for example and see how robust the results are (again too small or too large will have implications where if k is too small you may not be able to detect enrichments since too few peaks, and if too large, you may "borrow" peak neighbors from DORCs that aren't the best neighbors to begin with (this is the drawback of having a flat).

Regarding your second question (which is a good one) - we had tested this to some extent. The chosen window is based on both prior experimental work (CRISPR perturbations to map enhancer half-lives) and our estimates given truly paired ATAC/RNA data from the same single cells (see Ma et al text related to Figure 3). That is to say, you may not gain much by increasing this window, and it may lead to false positive associations, in addition to taking more compute time.

valentinaOpazo commented 7 months ago

Thank for your answer. Actually, I re-run runFigRGRN function using a k=10 and in both cases the knee in my plot was in n=~5 and the number and type of DORCs was very similar.