Improving speed - caching for adjustment data

matherealize commented 1 year ago

Continues #12

I think another great speed-up can be achieved by optimizing transform_data_step (formerly extract_adjustment_data) which generates the data required to fit models in a step of the mfp cycles.

This function repeatedly generates fp transformations for all adjustment variables. It could be optimized by storing these transformed data, until they change. This may be complicated, but could help a bit I guess.

An example: say you have 10 continuous variables x1 to x10, some of them with fp transform (e.g. fp2). Now in each step of a cycle one of these variables is evaluated given the current powers p1 to p10 of the other variables.

In step 1, x1 is evaluated given the other variables x2 to x10 and powers p2 to p10 and a new power p'1 for x1 is found.
In step 2, x2 is evaluated given powers p3 to p10 for x3 to x10, and power p'1 for x1. And so on.

extract_adjustment_data generates these transformations again and again in all steps. It may be worth to store the current transformation as a list of lists indexed by variable and power. From this list, we can retrieve the transformation if it is available, and update it if a new power is found. In the example above:

In step 1, x1 is evaluated given the other variables x2 to x10 and powers p2 to p10 and a new power p'1 for x1 is found. The list transformed_x is updated to hold all the transformed data of x1 to x10.
In step 2, x2 is evaluated given powers p3 to p10 for x3 to x10, and power p'1 for x1. In this step, we can make use of the data in transformed_x, and do not have to re-compute anything except x2, since the powers for x1 and x3 to x10 were already generated in step 1.

Thus, in each step (except the inital one), we only have to compute one variable at a time, and retrieve all other data from the list.

Importantly, in each cycle the list must be updated, since in each cycle the variables re-enter the selection procedure (as outlined in e.g. Sauerbrei W, Royston P. Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. Journal of the Royal Statistical Society Series a-Statistics in Society. 1999;162:71-94.)

EdwinKipruto commented 1 year ago

I agree Michael. We can improve the speed dramatically. When you come to Freiburg, we can do some stuff together

EdwinKipruto commented 1 year ago

acd(x) should be calculated once and used in each cycle since it does not change. This will improve the computation speed.

EdwinKipruto / mfp2

Improving speed - caching for adjustment data #22