Closed bas0N closed 1 day ago
I have reached the following split: --- Training Set Statistics --- Total Samples: 432 Total Speakers: 20 Authorized Samples: 216 Unauthorized Samples: 216 Authorized to Unauthorized Ratio: 216:216
Samples per Speaker: f1: 36 f10: 17 f2: 17 f3: 18 f4: 21 f5: 17 f6: 14 f7: 36 f8: 36 f9: 11 m1: 15 m10: 17 m2: 15 m3: 36 m4: 11 m5: 15 m6: 36 m7: 15 m8: 36 m9: 13
--- Validation Set Statistics --- Total Samples: 240 Total Speakers: 20 Authorized Samples: 72 Unauthorized Samples: 168 Authorized to Unauthorized Ratio: 72:168
Samples per Speaker: f1: 12 f10: 12 f2: 12 f3: 12 f4: 12 f5: 12 f6: 12 f7: 12 f8: 12 f9: 12 m1: 12 m10: 12 m2: 12 m3: 12 m4: 12 m5: 12 m6: 12 m7: 12 m8: 12 m9: 12
--- Test Set Statistics --- Total Samples: 240 Total Speakers: 20 Authorized Samples: 72 Unauthorized Samples: 168 Authorized to Unauthorized Ratio: 72:168
Samples per Speaker: f1: 12 f10: 12 f2: 12 f3: 12 f4: 12 f5: 12 f6: 12 f7: 12 f8: 12 f9: 12 m1: 12 m10: 12 m2: 12 m3: 12 m4: 12 m5: 12 m6: 12 m7: 12 m8: 12 m9: 12
I think it can be improved to 70/15/15 train/val/test ratio, yet it has very important features:
The results of model training with it can be seen here:
Nice job! I like the separate notebook and how it is designed now.
Nevertheless, I see a significant issue: the split ratio for train, validation and test datasets is broken for the result split. You specify the training dataset as 70% of files, but after excluding some files, it is 95% (because of this min(10, len(train)) thing in case of empty val/test set). Maybe you can inverse the logic to first split the dataset into small locally balanced subsets and then distribute subsets according to training/validation/test ratio. An additional minor issue is that (if I am not mistaken) we do not check that every speaker is present in the training dataset and I think we should do this.