#29 optimize abe_objects

biometrician commented 1 year ago

An abe_object contains at least min(1, length(alpha)) (min(1, length(tau)) num.resamples model fits. If Wallisch2021 is selected, then twice this number. With our examples, this is not a problem. But with large data sets, this might be an issue.

We could refrain from saving $x and $y in each model fit. Instead we just need to keep the subject IDs of each resample and the original data set once. Then if need be, the data of the specific resample can be easily reconstructed.

What do you say? (I will also discuss this with Michael Kammer.)

I do not think, that the data of the resamples are currently needed somewhere in the code.

rokblagus commented 1 year ago

Seems like a really major change, which I don't feel confident doing at this point. However, it is definitively sth which should be discussed in the future if it proves that it can cause issues.

biometrician commented 1 year ago

We discussed yesterday, that we might want to save the seed of abe.objects for reproducibility.

One idea might be to add a seed parameter to abe.resampling. The user can specify it or if it is empty the current random.seed is saved.

Something like that (maybe there are better solution):

x <- .Random.seed
result <- runif(n=10)
attr(result, "seed") <- x

.Random.seed <- attr(result, "seed")
result2 <- runif(n=10)

cbind(result, result2)

Rok and Gregor, what do you think?

I would suggest to collect ideas and then discuss them together - maybe invite Michael Kammer - as he has a lot of experience with this kind of coding.

Michael recommended to do the resampling at the beginning all at once and later do all the computations. Separate things where chance is playing a role with the rest.
The size of abe.objects can get incredible large. For a small data set I got an abe.object of size 4.6GB!Finally, we will have to save only results which are really needed later. For each resampled model one line in a matrix with all relevant information would be optimal. This would require major changes in the package. But it would also simplify extracting results from the object. (Currently we have to extract everything from lists.) As a first start, at least delete $X and $Y. In $model.parameters we can keep a counter how often each id was selected. (sparse matrix)
Currently abe.object$models can be the result of 3 types of resampling methods. Wouldn't it be less prone to errors if we have in the abe.object$bootstrap, abe.object$mnbootstrap, abe.object$subsampling. Depending on the call only one or two of those elements are filled and the rest is empty.

rokblagus commented 1 year ago

Adding the seed (and letting the user to specify it) is important and something that could be implemented rather quickly, we should simply see how it is done in other packages, e.g. they just recently added this option to the glmnet package.
Not sure what is meant here, I guess it refers to moving ids<-sample(...) outside of the for boot loop, which could easily be done and is in fact a good suggestion. As per parallelizing the computation, I don't think this will be problematic, since we would only have to replace for with foreach or similar. We should check how this is done in other similar packages, e.g. glment.
The way how I would approach this, is I would prepare a function, which would only recover the necessary info from a call to abe.boot.x which are called within abe.resampling. But this will then affect all the other functions (print, plot, summary,...), unless we keep the structure of the returned object exactly the same, only omitting the parts that we are absolutely sure that we will not need.
No, currently you only either have one type (any of the 3 resmpling models) or two of these if using Wallisch option. Having 3 as suggested will be in my opinion more problematic, since it would require changing all the helper functions. I also don't see how this would be less prone to errors, I would think the opposite. This would be different if we would allow the user two simultaneously perform more types, but I don't think we should go that far.

rokblagus commented 1 year ago

I made substantial update to abe.resampling so that

id of the selected samples is available in the output (+ the ids are selected prior to for boot loop, hence are the same for all alpha and tau).
we only keep the coefs and terms of the refitted models, which required making changes also to summary, plot and pie functions. Please check if everything still works.

Let me know, how I can add the seed option (or refer me to the relevant code in the relevant package). Let me know, how I can add the option for parallel computation (or refer me to the relevant code in the relevant package).

rokblagus commented 1 year ago

About saving only some parts: we could have an argument: save=c("minimal","all"), where the user could choose to save only the minimal quantities (coefs and terms, default) or everything (as it was before). In the later case we could issue a warning that the size of the object might be too large. What do you think?

biometrician commented 1 year ago

Hi, Thanks a lot for your efforts.

id of the selected samples is available in the output (+ the ids are selected prior to for boot loop, hence are the same for all alpha and tau). ==> My first instinct is, that this is dangerous as we reduce the variability. I will think that through.

Let me know, how I can add the option for parallel computation (or refer me to the relevant code in the relevant package). I am not an expert on parallelization, but I thought of the foreach solution. see e.g. https://nceas.github.io/oss-lessons/parallel-computing-in-r/parallel-computing-in-r.html

@gregorsteiner, what do you think?

biometrician commented 1 year ago

About saving only some parts: we could have an argument: save=c("minimal","all"), where the user could choose to save only the minimal quantities (coefs and terms, default) or everything (as it was before). In the later case we could issue a warning that the size of the object might be too large. What do you think?

Of course, this is an option. @rokblagus, do you think it is really worth the hassle? Is there a case, where one would need all results?

rokblagus commented 1 year ago

It does not cost us anything, so I don't see why we would not offer this option. The latest version has this option, but we can easily remove it if we decide to do so.

rokblagus commented 1 year ago

About: "id of the selected samples is available in the output (+ the ids are selected prior to for boot loop, hence are the same for all alpha and tau). ==> My first instinct is, that this is dangerous as we reduce the variability. I will think that through."

My opinion is the opposite, if I want to see how changing e.g. alpha affects the stability of the model, then it is only fair that different alphas are evaluated using the same data. Can easily be changed back to what we had before though.

About using foreach, I think this could be problematic, since we have a list to which we save the results, not sure how this would work with foreach.

rokblagus commented 1 year ago

I added the option of using foreach, see the notes in the commit. As far as I am concerned this takes care of everything that was discussed here.

rokblagus commented 1 year ago

As per "We discussed yesterday, that we might want to save the seed of abe.objects for reproducibility." If the only goal is to have reproducible results, then there is a very easy way of doing this, which I have now added. Now I think this really addresses everything within this topic. However, since it was a major update, we should check if everything works as it should. @gregorsteiner : we might have some noncompatibility issues with the abe.resampling function, since I think you were simultaneously working on it in your branch, please be careful to take my final version when making the merge.

biometrician / abe

#29 optimize abe_objects #29