Cross validation workflow

iancze commented 2 years ago

Cross validation is useful for determining optimal (or at least good enough) parameter settings for regularization.

Currently, though, most of the functionality for doing this exists outside of the MPoL package itself. This is partially by design and mirrors the way some PyTorch projects are set up with respect to functionality / optimizers. However, the current K-fold CV workflow is somewhat clunky and there are likely areas of improvement.

Describe the solution you'd like

Catalogue issues in the CV workflow using this issue @briannazawadzki
Explore potential designs for solutions. I think it makes sense to try to keep the core MPoL package focused on the evaluation of an image relative to interferometric data and have CV routines live in a separate MPoL-dev affiliated package (e.g., the way visread or mpoldatasets do). But there probably are a few changes to MPoL itself that would be useful.

iancze commented 2 years ago

Possibly useful for visualization (in addition to tensorboard): https://napari.org/stable/index.html

iancze commented 1 year ago

On a related but possibly separate note, @jeffjennings also mentioned that it might be interested to ensure that cross-validation blocks should always roughly have the same 1D weighted baseline distribution.

jeffjennings commented 1 year ago

I think one aspect of the current cross-val workflow that could be improved is the train/test set division in KFoldCrossValidatorGridded, moving from standard k-fold to stratified k-fold. It would address that:

Currently the data are divided into a list of cells using a Dartboard, and this cell list is then split into train/test sets. Because the number of visibilities in cells can vary a lot, the training sets often don't have a similar number of points (same for the test sets). Using a single dataset (of real obs.) as a trial case, the size of the test set varies by up to 35% for k=5.
- In turn the ratio of training:test set size can vary a lot, from 19% to 34% in the trial case.
I don't think it's best to withold grouped chunks of (u,v) space (whole dartboard cells) - the model should be able to accurately predict data it hasn't seen, but the test data should still be similar to the training data. There might be problematic edge cases too, like a highly asymmetric source.
- Using dartboard cells also makes it harder to ensure that training sets cover a similar baseline distribution.

A stratified k-fold approach would ensure the training sets have almost exactly the same number of points, including the same number in each of several baseline bins. This also ensures the train:test set size ratio is constant and ~exactly a chosen value.

briannazawadzki commented 1 year ago

We should implement an easy way to use uniform partitioning for CV, similar to how we implemented Dartboard.

briannazawadzki commented 1 year ago

See below for the forced (not generalized at all) implementation we used for testing in 2021

Messy random cell cross validation

briannazawadzki commented 1 year ago

KFoldCrossValidatorGridded will need to be generalized or changed, as right now it requires Dartboard and does not allow for other options. We could either rename this to communicate that it's dartboard specific, or we could make a generalized KFoldCrossValidatorGridded which can handle multiple types of partitioning.

iancze commented 1 year ago

Closing this issue for now, since the main action items (renaming and RandomCell gridding) were implemented by #132 . There are still larger discussions to be had about cross validation strategies (e.g., #93 ) and accuracy (most importantly), but once we progress those discussions a bit further we can open targeted issues for the codebase.

MPoL-dev / MPoL

Cross validation workflow #99