florianhartig / DHARMa

Diagnostics for HierArchical Regession Models
http://florianhartig.github.io/DHARMa/
210 stars 22 forks source link

Make DHARMa useable for big data / memory management #188

Open florianhartig opened 4 years ago

florianhartig commented 4 years ago

Connected to #187 , I wonder if there is any elegant solution to handle the case that simulations do not fit into memory. This is just a reminder for future development, no immediate plans to implement something.

dschoenig commented 3 years ago

Hi Florian,

I think using the ff package would be the best way to implement this feature, as it allows a large simulation matrix to be stored on disk and also comes with a set of apply functions to make the performance hit bearable. It would also increase the dependencies by only one package.

I decided to give it a try, based on some code I used to compute quantile residuals for very large models. I settled on the following workflow:

Some more comments and gotchas:

Apart from general hardware constraints, I see two use cases (and their combination) for storing the simulations on disk:

  1. One wants to strongly increase the number of simulations to stabilize the residuals (e.g. n = 2000).
  2. One would like to work with very large data sets (millions of observations).

For (1), the solution I outlined works out of the box. But (2) will cause most of the nice functionality of DHARMa, such as plots and tests, to become extremely slow, as computation time mostly scales with number of observations (or residuals). However, I'm not sure if DHARMa shoud even accommodate use case (2). There may not be much overlap between people that work with models of that size, and those that would like to use DHARMa to evaluate them.

A combination of (1) and (2) would almost certainly run into R's address limit. This is currently the case for some models I'm working with (107 observations with 1000 simulations) and it can be dealt with by splitting the simulation matrix column-wise into several objects, that are gathered as a list and looped over. But then again, I don't intend to use DHARMa with these models, and I doubt that anybody would in this case -- so implementing this "splitting" functionality is probably not worth the effort.

I'll set up a pull request so you can have a look at the code.

florianhartig commented 3 years ago

Hi Daniel,

many thanks for this, looking great on a first glace. I have to apologise in advance for probably not being able to respond to this immediately, I am super busy this week, will try to look at the PR as soon as possible.

Best, Florian