Genentech / FacileDataSet

A fluent API for accessing multi-assay high-throughput genomics data.
MIT License
4 stars 0 forks source link

make fetch_ fully backend-agnostic #21

Open phaverty opened 5 years ago

phaverty commented 5 years ago

To support other backends, I'm trying to confine all of the backend-aware code to the _tbl functions and fetch_assaydata. The other fetch functions sometimes call collect on a data.frame to make sure they do not have a lazy DB query. Can we make these functions not care if they have a lazy DB query? Which dplyr-ish things only work on a real tbl?

Also, the "collect" strips the "fds" attribute, which requires a "force(.fds)" to evaluate an arg like ".fds = fds(x)" before the "fds" falls off of "x". This highlights the original problem, but also the problem with attributes. This is a job for an S4 FacileFrame.

lianos commented 5 years ago

Two quick notes:

  1. I originally wanted to be all agnostic w/ no mention of collect most-anywhere, and the first round of the codebase was like that.

    In theory it should work, but in practice it didn't. A concrete example of one reason why it might not have was that up until very recently the SQLite backend didn't support windowing functions (because SQLite didn't), while the postgres backend would.

    Another thing that I do recall happening was that every now and again you'd get random / irreproducible errors thrown in TFE, and sprinkling in collect here and there seemed to take care of it.

  2. Again, I don't think S4 necessarily needs to be the answer here. The facile_frame S3 class (which the fetch_* function will return, will override the dplyr verbs (and subset, and "normal" bracket indexing), to ensure that they always spank the fds back into an attribute on the way out should also take care of it.

phaverty commented 5 years ago

S3 facile_frame works for me. Bummer about the collect calls. It would be nice to allow the lazy query objects in more places. There are some places with same_source calls that perform not-so-great if you have to pull one of the data frames into memory for a join.

phaverty commented 5 years ago

Maybe bump the RSQLite version (anyway)?

lianos commented 5 years ago

Sorry, but bumping RSQLite just can't be an option ... we need this to work on simple setups (single datasets, single analyst machines) to the *clouds of the world ...

jonocarroll commented 5 years ago

Which dplyr-ish things only work on a real tbl?

n(). The number of rows is not (easily) available without a collect(). This appears to be a design choice.

phaverty commented 5 years ago

Cool, I thought it was more. This is a major pet peeve for Michael, BTW. If you ever want to side-track a meeting ... I really thought there were others.

Pete


Peter M. Haverty, Ph.D. Genentech, Inc. phaverty@gene.com

On Fri, Sep 21, 2018 at 11:59 AM, Jonathan Carroll <notifications@github.com

wrote:

Which dplyr-ish things only work on a real tbl?

n(). The number of rows is not (easily) available without a collect(). This appears to be a design choice.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Genentech/FacileDataSet/issues/21#issuecomment-423639207, or mute the thread https://github.com/notifications/unsubscribe-auth/AH02KyLlqGvYfIb2WLG_4YR_DHLkU0AFks5udTcFgaJpZM4W0jYo .

jonocarroll commented 5 years ago

Oh, I don't mean that's all of them, but that's a big one.