Train-test split for current model using wrong indices?

finncatling commented 4 years ago

In 02_train_eval_current_model.py we use drop_incomplete_cases() to remove current-model-incomplete cases whilst preserving the DataFrame index. This index is preserved so that SplitterTrainerPredictor (using the Splitter base class, which in turn uses split_into_folds()) can preserve train and test set consistency across the current and novel models, such that for each split:

1) The current-model train cases are a subset of the novel-model train cases 2) The current-model test cases are the same as the novel-model test cases

However, preprocess_current() is called on the DataFrame between using drop_incomplete_cases() and SplitterTrainerPredictor, and resets the DataFrame index, breaking the above.

We need to:

1) Fix this bug 2) Check that an analogous bug isn't affecting the novel model 3) Amend our code to make it more difficult to make this mistake again at a later date

finncatling commented 4 years ago

I think I have fixed this on branch lap_risk_34 - all looks good locally but we still need to check it works with the real NELA data in the BDAU

finncatling commented 4 years ago

Fixed by #36

finncatling / lap-risk

Train-test split for current model using wrong indices? #34