Genentech / FacileDataSet

A fluent API for accessing multi-assay high-throughput genomics data.
MIT License
4 stars 0 forks source link

Let's drop multi-column pData features! #1

Closed phaverty closed 6 years ago

phaverty commented 6 years ago

If we are good with using Surv objects as pData columns, we can delete all kinds of code required for two-column features.

lianos commented 6 years ago

Are you referring to dropping these everywhere, or just for FacileDataSet creation and/or update?

Although I haven't tested it yet, I really like that you can use a Surv object as a column in a data.frame that you can use when running something like as.FacileDataSet ... I can totally see how this makes lives much easer.

But once the data type is dumped into a FacileDataSet (as right_censored I presume(?), but have to check) bringing it back out as a Surv object and manipulating it "fluently" can get tricky. For instance, subset-ing a data.frame seems to preserve things, but filter-ing it doesn't (the Surv column is simply turned into a double.

Also I reckon this will also translate into funky interactions when using with_sample_covariates, but need to noodle/play with this a bit ...

Just to clarify my thinking, what is giving me the most pause with this statement is how one interacts with a FacileDataSet after its creation, ie. when trying to extract data from it "fluently" ...

jonocarroll commented 6 years ago

I've been looking into this since @phaverty mentioned it was possible because it struck me as quite surprising.

It turns out you can store anything (any class foo) you want in a data.frame column as long as:

That's it. https://gist.github.com/jonocarroll/3a0ff7661f744b5cca25cc3865001272

Surv goes one step further in that it additionally has its own logic in str which makes the nice format appear there too.

This doesn't play nice with dplyr however, since it's merely the formatted-printing which is making it look like a normal data.frame

library(survival)
mysurv <- Surv(lung$time, lung$status)
survdf <- data.frame(n = 1:228, mysurv)
head(survdf)
#>   n mysurv
#> 1 1   306 
#> 2 2   455 
#> 3 3  1010+
#> 4 4   210 
#> 5 5   883 
#> 6 6  1022+
dplyr::filter(survdf, n < 5)
#> Error: Column `mysurv` must be a 1d atomic vector or a list

My vote would be for allowing input of the data as a Surv object (in which case there is no ambiguity over definitions) but as soon as it hits the internal data structure split it back out into time and event columns (the definitions of which we can now trust).

VRouilly commented 6 years ago

Good catch on the incompatibility with dplyr chain ! it is a shame. It looks like a 'Surv' object would still be helpful to streamline data input, and its character representation could still be used in the EAV table. However, we would need to handle a 'Surv' category in FacileData::castcovariate function to decode the 'Surv' character representation into the current 2 columns 'tte' and 'event_'. This asymmetric representation between data-in and data-out does not feel very satisfying though.

phaverty commented 6 years ago

Bummer about the tidyverse. We'll have to dump that. ;-) I already have a work-around for the known bug in bind_rows, but we can't work around everything.

Instead, let's keep the Surv object is its as.character() representation everywhere and have the survival module coerce it to Surv at the last minute. We can still mark it the character column with a "right_censored" Facile class label or the like.

FWIW, base R has special case code for Surv as a data.frame column, so it is officially supported to some degree.

Pete


Peter M. Haverty, Ph.D. Genentech, Inc. phaverty@gene.com

On Tue, Apr 24, 2018 at 10:08 PM, Vincent Rouilly notifications@github.com wrote:

Good catch on the incompatibility with dplyr chain ! it is a shame. It looks like a 'Surv' object would still be helpful to streamline data input, and its character representation could still be used in the EAV table. However, we would need to handle a 'Surv' category in FacileData::castcovariate function to decode the 'Surv' character representation into the current 2 columns 'tte' and 'event_'. This asymmetric representation between data-in and data-out does not feel very satisfying though.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Genentech/FacileData/issues/1#issuecomment-384162099, or mute the thread https://github.com/notifications/unsubscribe-auth/AH02K-FdSKKbV0iQYrjvcWA0OuR3ogqHks5tsAS1gaJpZM4TiBUz .

phaverty commented 6 years ago

Now that we have the cSurv class, we could do this. I'm keen to do it just to simplify things. @lianos, if its a feature you may want to use again, let's leave it. (I think it is cleaner to define new S3 vector classes to use as pData columns, but I can mind my own business.)

lianos commented 6 years ago

I don't know ... I'm not sure I'm too keen to drop it ... I reckon it can still be a handy mechanism to have down the road.

We definitely don't have to use it at all ... just put it in its own *.R script and keep the functions internal for now.

phaverty commented 6 years ago

OK, I'll leave it alone. I think my personal goal for this package is just to S3-ize all the functions I'll need to make FE run of alternate backends. I'm always up for deleting code and optimizing stuff, but I'll try to remember to stay with the main thread.

lianos commented 6 years ago

FWIW, I'm all for S3ing all the things

phaverty commented 6 years ago

Do you think we just need to S3 the ones we have? Or do we need everything that as an assert_FacileDataSet() call at the top?

lianos commented 6 years ago

Are you asking "in theory" or "in practice"? :-)

I guessing that having to assert if an object is of the class you expect is a code smell which should suggest that we should be really doing it "the OO" way, ie. s3-izing the functions that start with assert(x, "FacileData(Set|Store)").

So long term, yes? Right now, I am not too bothered by it, and I think the exercise of refactoring the FacileExplorer + adding new backends will better inform what really needs to get done, so: I'll leave that up to you and Vincent to be aggressive there when you feel like you have to since you folks are first-to-touch the FE stuff.

For me, the only new S3 thing I want to add is introducing the facile_frame, ie. add a class to the data.frames/tibbles returned by the fetch_* and which_* functions so the intercept dplyr verbs and other means of selecting/slicing data.frames to simply ensure that the set_fds(result, fds) stuff is automatic.

If that doesn't make complete sense, it will soon enough. This should also be a non-breaking change to the upstream code in the FacileExplorer you folks are playing with.

phaverty commented 6 years ago

I'm OK with the assertions being there. I'm wondering if a class that only implemented the FacileInterface would work with FE.

Pete


Peter M. Haverty, Ph.D. Genentech, Inc. phaverty@gene.com

On Mon, Aug 27, 2018 at 11:12 AM, Steve Lianoglou notifications@github.com wrote:

Are you asking "in theory" or "in practice"? :-)

I guessing that having to assert if an object is of the class you expect is a code smell which should suggest that we should be really doing it "the OO" way, ie. s3-izing the functions that start with assert(x, "FacileData(Set|Store)").

So long term, yes? Right now, I am not too bothered by it, and I think the exercise of refactoring the FacileExplorer + adding new backends will better inform what really needs to get done, so: I'll leave that up to you and Vincent to be aggressive there when you feel like you have to since you folks are first-to-touch the FE stuff.

For me, the only new S3 thing I want to add is introducing the facileframe, ie. add a class to the data.frames/tibbles returned by the fetch and which_ functions so the intercept dplyr verbs and other means of selecting/slicing data.frames to simply ensure that the set_fds(result, fds) stuff is automatic.

If that doesn't make complete sense, it will soon enough. This should also be a non-breaking change to the upstream code in the FacileExplorer you folks are playing with.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/Genentech/FacileDataSet/issues/1#issuecomment-416317114, or mute the thread https://github.com/notifications/unsubscribe-auth/AH02K8zu3uX17JkHW21Ti4NikneFv2Xkks5uVDaKgaJpZM4TiBUz .

lianos commented 6 years ago

Unless it turns out turning into some herculean effort, I thought we were driving towards a future (however far away) where the answer to that question is "yes" ... no?