Bioconductor / Organism.dplyr

https://bioconductor.org/packages/Organism.dplyr
3 stars 3 forks source link

Filters #3

Open mtmorgan opened 7 years ago

mtmorgan commented 7 years ago

There are filter concepts in S4Vectors, ensembldb, and now here. Shouldn't we have just one? One thing that drove us to implement our own filters rather than re-using ensembldb was the ability to easily generate them programmatically, whereas these are all 'hand-crafted' in EnsemblDb.

jorainer commented 7 years ago

I agree, that's a very elegant solution. I'll check if I could use a similar implementation in ensembldb.

jorainer commented 7 years ago

OK, should be easy to re-use the concept from Organism.dplyr in ensembldb, actually I could just import all of the filters I need.

Two things however:

mtmorgan commented 7 years ago

I agree that they should be upstream. AnnotationDbi seems like a very heavy package. Is it sensible to introduce a Filters or AnnotationFilters package?

I also don't like the snake_Camel notation; probably it is staying too close to the original (TxDb) schema.

And any hope @lawremi of using S4Vectors Filter stuff? My reservations are that it seems a little heavy for the current use, and I have sometimes found myself in a place (sorry for the vagueness) where I could not easily implement my own filter (something about evaluation environments?) and having to find a multi-year old email from you for direction.

jorainer commented 7 years ago

An AnnotationFilters package would be great! Might also be very helpful for users so that they have a central entry point to the filters.

I also don't like the snake_Camel notation; probably it is staying too close to the original (TxDb) schema.

Yes, would be nice to replace all _ from the database column names when generating the name of the filter object. To map them back I see two options, the heavy one that I'm currently using in ensembldb is to have a dedicated column method that does return the correct database column. Second option would be a more lightweight function that uses a character vector mapping database column names to Filter object names.

Regarding the S4Vectors FilterRules - had only a quick glance at it and I did not see a simple way to use that in ensembldb.

jorainer commented 7 years ago

@mtmorgan I really like the idea of an AnnotationFilters package that provides BasicFilter and some default additional filters that could be reused in Organism.dplyr and ensembldb. I think now might also be the best time to start implementing the package - later there might be too much changes that have to be implemented in Organism.dplyr. As it is now, loading of Organism.dplyr and ensembldb breaks functionality of both. If you want I can also contribute to that package.

mtmorgan commented 7 years ago

@jotsetung I started a package and invited you as collaborator with admin rights github.com/Bioconductor/AnnotationFilters.

@lawremi Should we be paying attention to S4Vectors::Filter*, or is that too ambitious?