bedatadriven / activityinfo-R

ActivityInfo R Language Client
https://www.activityinfo.org/support/docs/R/
17 stars 12 forks source link

Eager collection (View(), dplyr verbs, etc.) #77

Closed nickdickinson closed 1 year ago

nickdickinson commented 1 year ago

@akbertram The eager collection for View() and names() does not work because it breaks RStudio. If I implement names() to show colnames the behavior of View() gets weird and breaks. I think this is why it is also not implemented for dbplyr. @Ryo-N7's glimpse() and tibble::view() are work arounds. I will extend view() to work specifically with the remote record object. I think that RStudio puts a wrapper around View() and uses its own implementation that does not try to coerce to data.frame. It will be relatively easy for people to use lowercase view().

nickdickinson commented 1 year ago

I've now explictly exported tibble::view() so that users dont have to do that

Ryo-N7 commented 1 year ago

yup, can confirm: view(), View(), and glimpse() all work with a records lazy df object as intended!

akbertram commented 1 year ago

I spent some quite some time playing with this yesterday, actually used to analyse some data I had stored in ActivityInfo. It works quite nicely!

However, one thing I quickly realized is that a tbl_activityInfoRemoteRecords is not and can't be a perfect replacement for a data.frame or a tibble. I think Nick you were trying to explain this to me but I didn't fully get it.

For example, when I tried to use the $ or [ operators, as in records[1:2, "status"] you don't get any results, and we really wouldn't want to do an HTTP request each time. Same thing if you try to modify with $<- operator. View() doesn't work as expected (even if there are alternatives).

My take away from this is that in the documentation and training, we need to be very clear about using collect(). Auto collecting on unsupported dplyr verbs is very helpful, but still, you need to understand how laziness works in order to use the getRecords effectively.

Another observation is that the vast majority of ActivityInfo forms are relatively small, less than a thousand records, which means that the server can serve them almost directly from memcache and laziness doesn't buy you a performance gain. (Will probably be marginally slower because you're making multiple http requests).

Given the extra cognitive load for new users in understanding how to use lazy tibbles, I am wondering if we should make laziness opt-in ? That make using the function easy for new users, but give you a clear path for managing large datasets when needed. Something as simple as:

getRecords("ceam1x8kq6ikcujg", lazy = TRUE)

Now the downside of this is then there are two ways of getting records. The alternative is that we always include collect() in examples, even trivial ones:

 getRecords("ceam1x8kq6ikcujg") |> collect()

Though in this case I think we may need to attach dplyr so that people don't get tripped up if collect() is not on the search path.

@Ryo-N7 @nickdickinson what do you think?

nickdickinson commented 1 year ago

I like the idea of adding collect() everywhere to the documentation to emphasize that getRecords() is really designed to work with the dplyr verbs and not as a data.frame replacement.

The other place we use remote records objects is when one uses the convenience functions modifying / extracting copies of fromSchema and to use migrateFormField(). For that reason, I prefer to put collect() everywhere. We can still add the lazy argument = TRUE as default. If we set disable laziness by default then we would need to re-implement migrateFormField() and extractSchemaFromFields() to work with our custom tibble that stores the schema in the attributes. This is not difficult but it will allow users to do strange things like try to migrate an R only column to an existing field. Interesting case but maybe we don't want this complexity just yet!

Also, we can export collect() so it is always on the search path without the dplyr library. But I think it would be better to always include dplyr in examples anyway.

Ryo-N7 commented 1 year ago