Selection of validation indices in fast cross-validation algorithm

Sm00thix commented 5 months ago

Hi @parmentelat,

I have realized that there is a subtle error in my implementation of the fast cross-validation algorithm. In practice, the error has close to no effect but it does change the theoretical runtime of the algorithms explained in https://arxiv.org/abs/2401.13185 from Theta(NK(K+M)) to Theta(NK(K+M) + NP) where P is the number of cross-validation splits. The algorithms affected are the ones that compute training-set-wise X^T X (algorithm 2 only) and X^T Y.

The error arises due to the selection of validation-indices which, for each cross-validation split, I construct a boolean mask of length N specifying the validation set. This takes NP time. I have implemented a solution that does not construct a boolean mask of length N for each split. Instead, the new solution constructs for each validation split a list of length N_val, specifying indices into X and Y corresponding to the validation set for the current split. This restores the asymptotic runtime to Theta(NK(K+M)).

In practice this change does not really have an impact on the running due to the practical constants hidden by the Theta-notation. As a practical test, I re-ran the most extreme example in the benchmarks, that is time_pls.py -model fastnp2 -n 1000000 -k 500 -m 10 -n_components 30 -n_splits 1000000 -n_jobs -1 and found that the change in runtime is so small that it is not even visible on the graph.

Additionally, a side-effect of my proposed solution is that the argument cv_splits to ikpls.fast_cross_validation.numpy_ikpls.PLS.cross_validate can now be any iterable of hashable (due to the internal usage of dictionaries) instead of just integer arrays. This allows a user to define their cross-validation splits using, e.g., strings if they like. However, it no longer sorts the validation splits. Instead, the resulting metrics will be sorted according to the first occurence of each unique validation split in cv_splits.

I have updated the tests and doc strings to reflect these changes.

I do apologize for any disruption of your work in relation to the review at https://github.com/openjournals/joss-reviews/issues/6533

My questions are: 1) Will you accept these changes to the code? 2) Will you accept the timings.png as is or do you require that I rerun the benchmarks related to the fast cross-validation algorithms?

The changes are reflected in https://github.com/Sm00thix/IKPLS/commit/9efdc1af22049cf6065b15bdf280b1ad761895cb with a small update in https://github.com/Sm00thix/IKPLS/commit/d4b16a6b92a71782e565a5481e8fc4680ccd4318 and https://github.com/Sm00thix/IKPLS/commit/697cc4812af4343b2d3b82606b92f66bc8da1f04

I also wonder if adding the option to return the order, in which cross-validation was performed, would improve usability. Do you have any comment to this?

Sm00thix commented 5 months ago

Hi again @parmentelat, In relation to my previous comment in this thread, I just realized that I had been running the timing experiments with the original ikpls package instead of the updated one. Indeed, executing time_pls.py on the updated code would crash due to some issues with timeit and closures. I have fixed this issue in https://github.com/Sm00thix/IKPLS/commit/8dff883504759f024ae3b0f2d3925526511357e7 with a minor update in https://github.com/Sm00thix/IKPLS/commit/ab12af1ce17e31867898d71fb6b900ba720cac18 I just reran time_pls.py -model fastnp2 -n 1000000 -k 500 -m 10 -n_components 30 -n_splits 1000000 -n_jobs -1 with the updated code. And my changes seem to have improved the runtime from 19.3 minutes to approximately 17 minutes. This is, however, as argued above, the most extreme example. Given the logarithmic scale of the graphs, this is unlikely to visually change anything. However, as previously mentioned, I will rerun the experiments using the fast cross-validaiton algorithms, if you think that is better.

I am very sorry for any potential confusion caused by this thread.

parmentelat commented 5 months ago

My questions are:

Will you accept these changes to the code?

Will you accept the timings.png as is or do you require that I rerun the benchmarks related to the fast cross-validation algorithms?

my answers

yes
it's best if you can make sure to provide a figure that reflects a specific version of the code - and to mention that version in the paper itself

parmentelat commented 5 months ago

I also wonder if adding the option to return the order, in which cross-validation was performed, would improve usability. Do you have any comment to this?

in general I'd tend to say yes; can you be more specific on how you'd propose to g about doing this ?

parmentelat commented 5 months ago

and oops I forgot to link the meta-issue https://github.com/openjournals/joss-reviews/issues/6533

Sm00thix commented 5 months ago

My questions are:

Will you accept these changes to the code?

Will you accept the timings.png as is or do you require that I rerun the benchmarks related to the fast cross-validation algorithms?

my answers

yes

it's best if you can make sure to provide a figure that reflects a specific version of the code - and to mention that version in the paper itself

Thank you for the feedback. I will make sure to do as you suggest in 2.

Sm00thix commented 5 months ago

I also wonder if adding the option to return the order, in which cross-validation was performed, would improve usability. Do you have any comment to this?

in general I'd tend to say yes; can you be more specific on how you'd propose to g about doing this ?

Currently the cross_validate method returns a list[Any] of length P, where P is the number of cross_validation splits as determined by the number unique elements passed in the cv_splits:Iterable[Hashable] argument to cross_validate. This list contains the result of evaluating metric_function on each validation split. I was thinking along the lines of having cross_validate returning dict[Any] containing as keys the P unique elements in cv_splits and as values the results of evaluating metric_function on each of the P validation splits. Then a user can directly use the name of a validation split to lookup the results of that split in the returned dict. And the order of dict.keys() will be the same as the order in which validation splits were visited. Does my explanation make sense? Otherwise, I can implement it, and we can roll it back if you do not like it.

parmentelat commented 5 months ago

yes, I believe this proposal can only improve usability, please go ahead with it !

Sm00thix / IKPLS

Selection of validation indices in fast cross-validation algorithm #15