databio / GenomicDistributions

Calculate and plot distributions of genomic ranges
http://code.databio.org/GenomicDistributions
Other
25 stars 10 forks source link

Add calcNearestGenes Function #168

Closed nleroy917 closed 2 years ago

nleroy917 commented 2 years ago

A function that will accept a query, and annotate said query by adding columns with additional information:

  1. Nearest gene;
  2. Distance to that gene;
  3. Type of region (promoter, enhancer, etc).

The function is speedy. Lots of issue however with query regions not existing in an annotation set, so I created a flag to remove "unknown" query regions at the end before dumping to a datatable.

nsheff commented 2 years ago

Hi @NLeRoy917 can you ensure this passes R CMD Check and BiocCheck?

also you will need to:

nleroy917 commented 2 years ago

@nsheff I made a plotting function to supplement the calc function as well. Here is a picture of the output:

image

I updated the vignette as well, however, I'm not sure how the vignette can build... The annotations file I am using is quite large (~100mb zipped) so I don't think it makes much sense to include that in extdata. As such, I am unsure of how to write the vignette such that a plot appears or the vignettes don't fail to build.

Once I solve that I can probably complete the rest of the required tasks.

nsheff commented 2 years ago

I'm not sure I see the value of the plot, for 2 reasons:

  1. the log distance is hard to interpret. See comments here: #61
  2. more importantly, I don't really follow what the plot is showing. what exactly do you mean by 'gene type' ? do you mean 'component of a gene annotation'? I think you're showing distance to gene components, but I don't really understand why you would want to look at that.

So if this is showing the output of the calc function, then I think the calc function results may not be relevant

nsheff commented 2 years ago

I updated the vignette as well, however, I'm not sure how the vignette can build... The annotations file I am using is quite large (~100mb zipped) so I don't think it makes much sense to include that in extdata. As such, I am unsure of how to write the vignette such that a plot appears or the vignettes don't fail to build.

The vignette must be able to build; otherwise you will need to include it in a long vignette that is unbuilt. The example for the introductory vignette has to be runnable with a small example dataset, ideally one that's already included in the repository.

nleroy917 commented 2 years ago

I'm not sure I see the value of the plot, for 2 reasons:

  1. the log distance is hard to interpret. See comments here: Ideas for visualizing clustering of regions in set #61
  2. more importantly, I don't really follow what the plot is showing. what exactly do you mean by 'gene type' ? do you mean 'component of a gene annotation'? I think you're showing distance to gene components, but I don't really understand why you would want to look at that.

So if this is showing the output of the calc function, then I think the calc function results may not be relevant

The idea in the plot was that I am showing the distance to the nearest gene/region by gene/region type (i.e. distribution of distances to nearest region where the nearest region is exon...)

The idea of the plot was to visualize what was going on in the function... so it sounds like I need to re-evaluate then the execution of the function.

nsheff commented 2 years ago

Here's some tools you can use to build and test the R package:

packageName="GenomicDistributions"

packageFolder=paste0("~/code/", packageName) # local path
setwd(packageFolder)
devtools::document(packageFolder)
devtools::install(packageFolder, upgrade="never")
devtools::test(packageFolder)
devtools::run_examples(packageFolder)
devtools::check(packageFolder)
devtools::build_vignettes(packageFolder)

You should run all of these to make sure the package is in good shape.

nleroy917 commented 2 years ago

@nsheff I updated the function. Simplified and did away with the plot. In the vingette I am using data internal to the package (TSS_hg19).

_(Note: in order to leverage the system.file() function to access the data, I copied the TSS_hg19.rda file to inst/extdata, which feels wrong, but this was the only way to access that annotation file when running/building the vignettes)._

To roxygenize I am running:

roxygen2::roxygenise()

Which runs fine.

Then, when running R CMD check I receive the following output for my function specifically:

> checking R code for possible problems ... NOTE
  calcNearestGenes: no visible global function definition for ‘nearest’
  calcNearestGenes: no visible global function definition for
    ‘distanceToNearest’
  calcNearestGenes: no visible binding for global variable ‘gene_id’
  Undefined global functions or variables:
    distanceToNearest gene_id nearest

Both nearest and distanceToNearest are GenomicRanges S4 methods. Documented here. It doesn't seem like they get added to NAMESPACE when I roxygenize. Is there something else I need to do to explicitly define these imports/dependencies?

nsheff commented 2 years ago

@kkupkova how are you recommending people access data included in the package? with data() rather than system.file ?

https://github.com/databio/GenomicDistributions/blob/79ef44f438356b67e73c51503178d99893291fd2/R/data.R#L23

nsheff commented 2 years ago

For the NOTEs, see here I think this is what you're running into:

https://github.com/databio/GenomicDistributions/blob/master/R/package.R#L45-L68

kkupkova commented 2 years ago

@kkupkova how are you recommending people access data included in the package? with data() rather than system.file ?

https://github.com/databio/GenomicDistributions/blob/79ef44f438356b67e73c51503178d99893291fd2/R/data.R#L23

I think that most of our examples are as data()

nleroy917 commented 2 years ago

@nsheff I think this might be wrapped up. Just waiting for the code review from @kkupkova. We can't auto-merge yet since my feature branch is based on master (my mistake) and not dev, and the NAMESPACE files are different there. I think the dev branch needs to be merged to master, I'll pull changes to my branch, then the merge conflicts will resolve.

I might be able to git rebase dev... I think. I don't want to blow away any changes I made, however.

kkupkova commented 2 years ago

I don't think the function does what it's supposed to do - why are you looking for overlaps when you are trying to find the nearest gene? I would recommend to look at calcFeatureDist function and base this function on that. Also, I don't think that current function can process GRangesList.

nleroy917 commented 2 years ago

I have an updated function that's based off the dev branch I will push here soon. I'm going to close this and open a new PR for that.