In 02_train_eval_current_model.py we use drop_incomplete_cases() to remove current-model-incomplete cases whilst preserving the DataFrame index. This index is preserved so that SplitterTrainerPredictor (using the Splitter base class, which in turn uses split_into_folds()) can preserve train and test set consistency across the current and novel models, such that for each split:
1) The current-model train cases are a subset of the novel-model train cases
2) The current-model test cases are the same as the novel-model test cases
However, preprocess_current() is called on the DataFrame between using drop_incomplete_cases() and SplitterTrainerPredictor, and resets the DataFrame index, breaking the above.
We need to:
1) Fix this bug
2) Check that an analogous bug isn't affecting the novel model
3) Amend our code to make it more difficult to make this mistake again at a later date
In
02_train_eval_current_model.py
we usedrop_incomplete_cases()
to remove current-model-incomplete cases whilst preserving the DataFrame index. This index is preserved so thatSplitterTrainerPredictor
(using theSplitter
base class, which in turn usessplit_into_folds()
) can preserve train and test set consistency across the current and novel models, such that for each split:1) The current-model train cases are a subset of the novel-model train cases 2) The current-model test cases are the same as the novel-model test cases
However,
preprocess_current()
is called on the DataFrame between usingdrop_incomplete_cases()
andSplitterTrainerPredictor
, and resets the DataFrame index, breaking the above.We need to:
1) Fix this bug 2) Check that an analogous bug isn't affecting the novel model 3) Amend our code to make it more difficult to make this mistake again at a later date