EdwinKipruto / mfp2

3 stars 0 forks source link

Improving speed - caching for adjustment data #22

Open matherealize opened 1 year ago

matherealize commented 1 year ago

Continues #12

I think another great speed-up can be achieved by optimizing transform_data_step (formerly extract_adjustment_data) which generates the data required to fit models in a step of the mfp cycles.

This function repeatedly generates fp transformations for all adjustment variables. It could be optimized by storing these transformed data, until they change. This may be complicated, but could help a bit I guess.

An example: say you have 10 continuous variables x1 to x10, some of them with fp transform (e.g. fp2). Now in each step of a cycle one of these variables is evaluated given the current powers p1 to p10 of the other variables.

extract_adjustment_data generates these transformations again and again in all steps. It may be worth to store the current transformation as a list of lists indexed by variable and power. From this list, we can retrieve the transformation if it is available, and update it if a new power is found. In the example above:

Thus, in each step (except the inital one), we only have to compute one variable at a time, and retrieve all other data from the list.

Importantly, in each cycle the list must be updated, since in each cycle the variables re-enter the selection procedure (as outlined in e.g. Sauerbrei W, Royston P. Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. Journal of the Royal Statistical Society Series a-Statistics in Society. 1999;162:71-94.)

EdwinKipruto commented 1 year ago

I agree Michael. We can improve the speed dramatically. When you come to Freiburg, we can do some stuff together

EdwinKipruto commented 1 year ago

acd(x) should be calculated once and used in each cycle since it does not change. This will improve the computation speed.