Genentech / FacileDataSet

A fluent API for accessing multi-assay high-throughput genomics data.
MIT License
4 stars 0 forks source link

first go at making a FacileInterface #10

Closed phaverty closed 6 years ago

phaverty commented 6 years ago
lianos commented 6 years ago

More or less "Yes", except:

  1. Codebase has been 2 spaces indent instead of 4 ;-)
  2. Since default_assay() in (at least) the fetch_assay_data() S3 function, I think we need to promote at least the following to S3/API level:
    • default_assay
    • assay_names
  3. Given that you are calling this "FacileInterface" now (fine by me), does this mean we shouldreplace line 113 in the FacileDataSet constructor from this:
    class(out) <- c("FacileDataSet", "FacileDataStore")

    to this?

    class(out) <- c("FacileDataSet", "FacileInterface")

Related to (1) should we use styler with a proper style guide (for this codebase, something close to the tidyverse style guide would be first preference, although it is violated already in original codebase in a few places that will be hard to change (ie. the as.matrix, subset.threshold etc parameters in fetch_assay_data (and perhaps not worth while to change in the legacy part)) ... just a thought

lianos commented 6 years ago

Just to reiterate my feeling for a need to put "FacileInterface" or "FacileDataStore" or something in the class hierarchy because I'm pretty sure it will often be handy to have a quick way to see if some object is "facile" in nature (ie. supporting the facile api).

... in this way, maybe it makes sense to think of the FacileInterface/API being more like a Java abstract class as opposed as an implements interface ... know what I mean?

lianos commented 6 years ago

@jonocarroll you say:

my only concern (due to lack of experience using it) is that it might be possible to end up trying to use some functions on an object which does not formally inherit from FacileDataSet

I think there's a bit of confusion here, but I also think this confusion speaks a bit to my point 3 in this comment and my follow-up comment/clarification

Which is to say that I think you're saying that "sometimes you want to know if the object you have at least advertises itself as supporting the 'Facile API'"

As a point of clarification, though: nothing will likely inherit from FacileDataSet, you should rather think of the FacileDataSet as a reference implementation of a class that either:

  1. Completely implements the entirety of the "Facile API"; or
  2. A non-abstract subclass of some abstract public class Facile(DataStore|Interface) (in the Java sense).

Imagine a FacileBioc package that defines decorator classes like: FacileSummarizedExperiment, FacileMultiAssayExperiment, FacileDGEList, etc.

I expect having class(aFacileSummarizedExperiment) or class(aFacileDataSet), etc. return something like c("FacileSummarizedExperiment", "FacileInterface") and c("FacileDataSet", "FacileInterface") so that some code somewhere can just ask is(x, "FacileInterface") will prove useful.

This is also why I slightly prefer thinking of the "FacileInterface" more like a Java-"abstract class", and calling "FacileInterface" something like "FacileDataStore" because is(x, "FacileDataStore") just reads better to me.

jonocarroll commented 6 years ago

The confusion may be mine, but I was thinking of a scenario more along the lines of some function calling e.g. dplyr on a FacileDataSet-classed object and returning an object which is not of class FacileDataSet, then this object being used in a call to a foo.FacileDataSet method. In that case, the S3 dispatch will fail. That would be the fault of the first function not returning the right class, but we're forcing that restriction if we only have a foo.FacileDataSet method with a disabled generic.

I suggested S4 earlier as a more formal OO system, but if it's reference semantics you want then R6 is right there (😱).

I have no problem with having an additional class FacileInterface in the heirachy but it presumes you're extending some API, or would potentially have some methods with limited capability amongst these more fundamental objects. Since I only know of FacileDataSet on top of this so far, an example of either a second extension class or a limited generic might help it make more sense. In the case of your FacileSummarizedExperiment suggestion, what would be a common method to these which would not be specific to their primary class? TL;DR: what is the contact surface unique to FacileInterface? What would you do with is(x, "FacileInterface") that you wouldn't otherwise check for is(x, "FacileDataSet")?

lianos commented 6 years ago

@jonocarroll

I'll start by addressing your last TL;DR question:

TL;DR: what is the contact surface unique to FacileInterface? What would you do with is(x, "FacileInterface") that you wouldn't otherwise check for is(x, "FacileDataSet")?

There is no "unique contact surface". If "I" see a "FacileInterface" somewhere in an objects class hierarchy, it's just a step-0 sign that suggests to "me" (some code in some function) that this object supports the Facile API ... that's it.

To this end, a "FacileSummarizedExperiment" will be a simple decorator class around a SummarizedExperiment which will have all of the facile api functions defined (and working) over SummarizedExperiment assay storage container. This object will also have injected somewhere in its class() hierarchy "Facile(DataStore|Interface)".

If you're asking me "well what's the use of that?" Then I can only say that "interfaces" and/or "abstract classes" are just a useful tool to have in any OO language, and R doesn't officially have it is all. There is, of course, a big drawback in injecting some API or abstract class into the class hiearachy of "Facile data stores", which is that there are no runtime guarantees that the object, in fact, does implement all functions, but we've got our hands tied here ...

Back to your first paragraph

Could you give a pseudo-code snippet that mocks up the scenario that provides an example of the scenario you are thinking of?

To clarify part of a comment I made earlier, raw" dplyr verbs (like filter, arrange, etc.) won't be called on a FacileDataSet object, but may be called on a data.frame returned from a facile api function call (like fetch_assay_data). We will also need to add something to the class() hierarchy of such data.frames that come out of these functions (perhaps facileframe) and then have the FacileData package override the dplyr verbs to simply ensure we set_fds() on the outgoing dplyr result, eg. the FacileData package would have functions like:

   filter.facileframe <- function(.data, ...) {
     orig.classes <- class(.data)
     .fds <- fds(.data)
     class(.data) <- setdiff(orig.classes, "facileframe")
     res <- filter(.data, ...)
     class(res) <- orig.classes
     set_fds(res, .fds)
   }

(... how's that for a painful improv-implementation ...)

I pretty strongly believe that we don't need S4 or R6, so if you could provide some pseudo-code that outlines your scenario here, maybe we can figure this out a bit better?

To your last paragraph (FacileInterface hierarchy ...)

We are simply trying to enable multiple data storage implementations to be used as "drop in replacements" for functions in the "facileverse". A scenario would be something like this, let's take:

a. the FacileTCGADataSet I've built which has all of the rnaseq data extracted from the RangedSummarizedExperiment you can download from the TCGA dataset as well as other assay data from the TCGA. Let's call this object tcga.fds

b. Let's also take that same recount2 RangedSummarizedExperiment object itself and call this object tcga.se

Further imagine that both have "sample covariates" named "sex", which can be either "m" or "f". For the tcga.se object, this would be found in colData(tcga.se)$sex.

We'd like these function calls to both work, and (more or less) return equivalent data:

fds.samples <- filter_samples(tcga.fds, sex == "f")
se.samples <- filter_samples(tcga.se, sex == "f")

I'm pretty sure you get that though, what's confusing me is when you say:

I have no problem with having an additional class FacileInterface in the heirachy but it presumes you're extending some API

Sorry for being pedantic, but we're not extending an API, but rather just implementing the same "Facile API" for different data storage containers, and injecting an API-like (or "abstract class"-like) "thing" in the class hierarchy simply tells "whoever is asking" that this "thing" (a FacileDataSet, or a FacileSummarizedExperiment, or a FacileMultiAssayExperiment, or a FacileDGEList) does, in fact, support the facile api.

... of course the program can realize that some object "FacileSomething" doesn't support the API when it errors-out because "FacileSomething" doesn't have a fetch_assay_data.FacileSomething function defined, but I feel like it's handy to get a quick/dirty way to check that up front in different scenarios.

Does that help at all? If not, I think maybe we're just on the wrong frequency and will need another way to chat about this one ;-)

EDIT: actually, another example that might help motivate this is to think of some of the higher level functions that combine results from lower level ones. We have some functions like "assays_over_samples" or something (sorry, already closed my laptop for the night :-). As long as some of the lower level facile api functions are defined and return the results in the correct data.frame format, this function can be sure to work on anything that is a "FacileDataStore" ... does that help, hopefully?

jonocarroll commented 6 years ago

Okay, I think I get the motivation behind adding FacileInterface to the hierarchy, though it seems redundant until you have some 'Facile API' stuff that isn't 100% covered by a higher-up class (and perhaps you will very soon). I could see that something like summary.FacileInterface could provide a more general method to both FacileDataSet and FacileSummarizedExperiment. Using is.FacileInterface(x)/is(x, "FacileInterface") doesn't strike me as more useful than s/Interface/DataSet/ until you have at least one foo.FacileInterface method.

In your TCGA example I would expect a filter_samples.FacileDataSet and a filter_samples.SummarizedExperiment. If you're implying that these could both be handled by filter_samples.FacileInterface then I can only guess that it involves moving the logic to branching within the method, which probably means I've misunderstood.

My 'extends' comments were in the train of thought of tbl_df 'extending' data.frame in that there are some functions specific to tbl_df, but otherwise if methods aren't specifically implemented then they can fall back to a data.frame method. In this sense tbl_df extends the data.frame API. Following that logic, FacileInterface would be the more general, and FacileDataSet the more specific. At least that's how I understand S3 classing.

In our case we have a single method and a disabled generic, so I wanted to be careful about what we would be falling back to if the dispatch was not correct (if, through error or mishandling by the user, set_fds did not occur).

I suspect a lot more of my misunderstandings can be solved by me buying you a few beers, so I'll leave it there.

phaverty commented 6 years ago

Glad you guys are excited about this! Please let me clarify my terminology:

I'm happy with making FacileDataStore the parent class of any of our FacileXXX classes. But, I think we can do SummarizedExperiment and such without decorator classes. The only code that should care what type of object we have are the methods in FacileInterface. And, they already know what class you are because they are methods and you got into them via dispatch. Let's see where that gets us. If we have to do FacileSummarizedExperiment, so be it.

phaverty commented 6 years ago

Oh, and by AbstractFacileXXX I just meant AbstractFacileDataStore or AbstractFacileBackend or something. They can be only one.

phaverty commented 6 years ago

Also, did anyone get my "Fetch" joke in the comments?

VRouilly commented 6 years ago

If I may, I will join the fun.

@phaverty, I also really like that you got this conversation going on a concrete example of an "R-Interface" for current and future FacileDataStores.

As a side note, because of the is(x, "FacileDataStore"), highlighted by @lianos , I would also vote for the "interface" to be called "FacileDataStore" as opposed to FacileInterface.

In terms of the API methods you illustrate (organism, samples, fetch_XXX, ...), I am not sure to understand why the fallback Facile API functions are on .default. To me, as in the code we will always have a check on is(x, "FacileDataStore"), I would expect the fallback method to be provided by the FacileDataStore class. A .default for me would be the fallback for a method NOT implemented by a FacileDataStore. Does it make sense ?

phaverty commented 6 years ago

I'm happy to have the class for FacileDataSet be c("FacileDataSet","FacileDataStore") let's do that. I hope that we never need todo is(x,"FacileXXX") because we will always get that from dispatch.

The point of the default methods is that there is no default. You have to write a specific method for each class. With the inheritance above, you could fall back to the method on FacileDataStore, but these methods will probably all depend on the specific backend implementation. Where ever we can write generic code, we can put that in the default, or in the FacileDataStore, as low as possible. For example, if we can define fetch_x in terms of only fetch_y and fetch_z, then we win. I really like that pattern and I use it in gneDB and in my julia stuff.

lianos commented 6 years ago

@VRouilly: thank you for chiming in, I was going to poke you via a CC when I got to work this morning.

@phaverty: Pretty sure we're 99.83% driving towards the same destination:

  • FacileInterface, in the java sense, a set of methods a class must implement
  • FacileDataStore, the abstract class in the true class sense, I strongly prefer AbstractFacileXXX

If you feel strongly about sticking the Abstract in front of FacileDataStore, then go for it. If we align our lingo with Java world (which my frame-of-reference is unfortunately stuck around version Java SE 6), then I think that's likely technically the right name for it.

So, if we agree to inject AbstractFacileDataStore into the the root "facile-class-hierarchy" of every object that extends it, are we using "FacileInterface" as anything more than a documentation tool, then? I mean, do you imagine having something dispatch on FacileInterface in the code, or?

Just in the way that R works, I can't figure out how having both "AbstractFacileDataStore"and "FacileInterface" being used in the codebase can change something, so if so: happy to learn what you're thinking,

I totally get that using "FacileInterface" as a documentation "hook" is very useful, though.

I think I was thinking along the same lines as what @VRouilly mentioned re: "stop"-ing on facile *.deafult methods: if AbstractFacileDataStore is in the class hierarchy of "facile data containers", then it might seem more natural to have, say:

fetch_assay_data.FacileDataStore <- function(x, ...) {
  stop("This function needs to be implemented in a storage-specific manner")
}

However, I think @phaverty's motivation for rather stoping on the *.default method is because I think this would make sense if we're trying to "facilitate" things like SummarizedExperiment objects without having to introduce a decorator FacileSummarizedExperiment class first.

Related to this, if we did go the "decorator" route, then is(aSummarizedExperiment) would return something c("FacileSummarizedExperiment", "AbstractFacileDataStore", "SummarizedExperiment", "Vector", "Annotated"), or something along those lines.

@phaverty if you want to take a stab at going the "pure" API route w/o the decorator (not now, I know, just talking about sometime in the future), then I'm all for it.

In conclusion

Although there's a lot of discussion here, like I said before I actually think we're actually all driving towards the same vicinity given the "concluding" remarks from @phaverty and @VRouilly, so I "approve" of this PR in the general sense, because I think small technical details can be iterated over.

@jonocarroll you're right in that we only have one implementation of a FacileDataStore in the FacileDataSet, but S3-izing these functions in this way is just laying down a foundation that will enable us to more easily swap in other data stores that implement the FacileInterface in the (mid(?)) future?

I think @phaverty is excited about leveraging TileDB for instance(?) And we also had spoke briefly somewhere on github about trying to make a FacileDataStore that be queried remotely and push results through some type of network pipe ... imagine some REST-like interface to pull data from a web service, but only faster than that :-)

phaverty commented 6 years ago

FacileInterface is just a label for the set of functions. It will be seen in the Roxygen "family" tags and we can use it when we talk about these functions. It will not appear in the code.

phaverty commented 6 years ago

Yes, I want a few more backends. I'm interested in TileDB. We are re-doing the whole EP backend and I want facile to talk to that. There will be many ...

phaverty commented 6 years ago

I'm hopeful that we can pull this off without making everything inherit from AbstractFacileDataStore. At this point, my primary need is for a list of all the methods that make up the API so I can understand the landscape and think about building abstractions between the layers of the app.

VRouilly commented 6 years ago

As we, @phaverty @lianos @jonocarroll , all seem to be a similar mindset, how about getting practical to feed our discussions ? It would mean to have a go at implementing a first draft of these other backends ? It could helpful to iron-out some of the details of the generic interface.

In terms of backends of immediate interest, do we agree with this list:

Is it the right time to do this ?

ps: could you add me as a reviewer so I can get notifications, thx.

lianos commented 6 years ago

@VRouilly I agree with you that having each of us try and hammer out facile versions of different data stores will help to flesh out the API and its corner cases, but I don't think that right now is the right time to do really do it.

I think laying down the groundwork as we've done here to just enable to do that in the future w/o having to change "much" in the core codebase is sufficient, for now.

Again: this is just my opinion given time we all have to contribute directly to this effort and certain deadlines we (I, at least) would love to meet (with this and other projects).

phaverty commented 6 years ago

TileDB may be part of ExpressionPlot or I may swap it in for HDF5 in FacileDataSet2. For ExpressionPlot, we'll just use their API to ask for a SummarizedExperiment for a given project. I'll probably try that out so I can see if this Interface provides an abstraction that is sufficient to separate FacileExplorer from FacileDataSet.

Pete


Peter M. Haverty, Ph.D. Genentech, Inc. phaverty@gene.com

On Tue, Aug 21, 2018 at 11:31 AM, Steve Lianoglou notifications@github.com wrote:

@VRouilly https://github.com/VRouilly I agree with you that having each of us try and hammer out facile versions of different data stores will help to flesh out the API and its corner cases, but I don't think that right now is the right time to do really do it.

I think laying down the groundwork as we've done here to just enable to do that in the future w/o having to change "much" in the core codebase is sufficient, for now.

Again: this is just my opinion given time we all have to contribute directly to this effort and certain deadlines we (I, at least) would love to meet (with this and other projects).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Genentech/FacileDataSet/pull/10#issuecomment-414776759, or mute the thread https://github.com/notifications/unsubscribe-auth/AH02K34cVWCfKmrXBCVKKD2NWKnBZ7yAks5uTFIVgaJpZM4WFFgH .