extend prepare_datasets logic to allow different data splitting methods

I have reached the following split: --- Training Set Statistics --- Total Samples: 432 Total Speakers: 20 Authorized Samples: 216 Unauthorized Samples: 216 Authorized to Unauthorized Ratio: 216:216

Samples per Speaker: f1: 36 f10: 17 f2: 17 f3: 18 f4: 21 f5: 17 f6: 14 f7: 36 f8: 36 f9: 11 m1: 15 m10: 17 m2: 15 m3: 36 m4: 11 m5: 15 m6: 36 m7: 15 m8: 36 m9: 13

--- Validation Set Statistics --- Total Samples: 240 Total Speakers: 20 Authorized Samples: 72 Unauthorized Samples: 168 Authorized to Unauthorized Ratio: 72:168

Samples per Speaker: f1: 12 f10: 12 f2: 12 f3: 12 f4: 12 f5: 12 f6: 12 f7: 12 f8: 12 f9: 12 m1: 12 m10: 12 m2: 12 m3: 12 m4: 12 m5: 12 m6: 12 m7: 12 m8: 12 m9: 12

--- Test Set Statistics --- Total Samples: 240 Total Speakers: 20 Authorized Samples: 72 Unauthorized Samples: 168 Authorized to Unauthorized Ratio: 72:168

Samples per Speaker: f1: 12 f10: 12 f2: 12 f3: 12 f4: 12 f5: 12 f6: 12 f7: 12 f8: 12 f9: 12 m1: 12 m10: 12 m2: 12 m3: 12 m4: 12 m5: 12 m6: 12 m7: 12 m8: 12 m9: 12

I think it can be improved to 70/15/15 train/val/test ratio, yet it has very important features:

All speakers are represented in the training set.
Training set has a 1:1 ratio of authorized to unauthorized samples.
No speaker-script combination is present in more than one set. (no testing on the data it was trained)

The results of model training with it can be seen here:

OriginalSizeCNN: https://api.wandb.ai/links/wch-basinski-politechnika-warszawska/6q32y095
TutorialCNN: https://api.wandb.ai/links/wch-basinski-politechnika-warszawska/hs0lfkug
TutorialCNN without standardization: https://api.wandb.ai/links/wch-basinski-politechnika-warszawska/rr915m74

Willu12 / iml

extend prepare_datasets logic to allow different data splitting methods #9