Hold-one-participant-out testing to prevent overfitting

dimstudio commented 4 years ago

Introduce a new functionality in which the neural network can handle a dataset in which we record multiple users. For this particular purpose, we can use the _CPRexperiment dataset containing 2 sessions x 11 users = 22 sessions in total. For each user, there are 2 sessions (two zip folders). The target classes are: targets = ['classRate', 'classDepth', 'classRelease'

The approach should be the following:

learn a single (or stacked LSTM) merging the sessions together
test it with the 70/30 hold out training
NEW: hold-one-user-out testing

in this way, we can check how well the model does with the data of one "unseen" user.

HansBambel commented 4 years ago

CPR_experiment dataset containing 2 sessions x 11 users = 22 sessions in total. For each user, there are 2 sessions (two zip folders)

So basically the data_helper is splitting the users up and the neural network is trained on a single user again?

dimstudio commented 4 years ago

I would suggest training the neural network with data from all 10 users (and leaving 1 out). Therefore the data_helper stays the same (groups the sessions together) but the validation split by user. Does it make sense?

dimstudio commented 4 years ago

For the dataset check Skype.

HansBambel commented 4 years ago

Aaah. So you don't mean multiple people simultaneously in a scene. Just recordings from different people, right?

Yeah. We can do that.

HansBambel commented 4 years ago

I am working on this today

By the way: I have only one session for P8 in the cpr_experiment folder

dimstudio commented 4 years ago

I sent you a new version complete with 2 sessions from P8. Please consider there will be a new CPR dataset soon with other 10 participants, I am finalising the annotations.

HansBambel commented 4 years ago

Done with #12

The splitting of training and test data will be done only once and can then be used again. If new data is added or another data split is wanted, just delete the train and test folders or the files that you want to substitute.

In order to train specific participants and test with another just specify them in the train_model and test_model functions.

dimstudio commented 4 years ago

@HansBambel I don't manage to train the models on the CPR_experiment dataset

Traceback (most recent call last):
  File "C:/Users/Daniele-WIN10/Documents/GitHub/SharpFlow/model_training_pytorch.py", line 413, in <module>
    train_test_model()
  File "C:/Users/Daniele-WIN10/Documents/GitHub/SharpFlow/model_training_pytorch.py", line 383, in train_test_model
    train_model(save_model_to, f"{dataset}/train", to_exclude, ignore_files, target_classes)
  File "C:/Users/Daniele-WIN10/Documents/GitHub/SharpFlow/model_training_pytorch.py", line 307, in train_model
    fit(epochs, model, loss_func, opt, train_dl, valid_dl, save_every=None, tensorboard=False)
  File "C:/Users/Daniele-WIN10/Documents/GitHub/SharpFlow/model_training_pytorch.py", line 47, in fit
    acc, prec, recall = acc_prec_rec(model, valid_dl)
  File "C:/Users/Daniele-WIN10/Documents/GitHub/SharpFlow/model_training_pytorch.py", line 320, in acc_prec_rec
    total_tp += torch.sum((ypred_thresh == 1) * (ypred_thresh == yb))
RuntimeError: Expected object of scalar type Bool but got scalar type Float for argument #2 'other'

Process finished with exit code 1

I also have two comments:

ignore_files = ['kinect'] is assigned in two points (btw for the CPR_experiment we should not ignore Kinect)
Train/Validation split is confusing to me, I see train_valid_split=0.7 and train_split = 0.7 and train_test_ratio=0.85. Maybe we can define all the constants like hidden_units = 128 and batchsize = 64 on the top of the script?

HansBambel commented 4 years ago

That is weird... For me it works... I think there is something off with the yb. Can you check what is in there?

Concerning your comments: I see, there is some left over code from my refactoring earlier. I cleaned it up. (This will get rid of the train_split=0.7)

The other split ratios are for dividing into training and testing set (this should be done only once per dataset) and the train_valid_split is for dividing the training set (so, not the testing set) into training and validation set.

I also moved some training variables outside to the train_test_model function.

Done in PR #15

dimstudio commented 4 years ago

I tested it, the Pytorch implementation works well now 👍

dimstudio / SharpFlow

Hold-one-participant-out testing to prevent overfitting #8