Willu12 / iml

MIT License
1 stars 0 forks source link

extend prepare_datasets logic to allow different data splitting methods #9

Closed bas0N closed 1 day ago

mytkom commented 1 week ago

Nice job! I like the separate notebook and how it is designed now.

Nevertheless, I see a significant issue: the split ratio for train, validation and test datasets is broken for the result split. You specify the training dataset as 70% of files, but after excluding some files, it is 95% (because of this min(10, len(train)) thing in case of empty val/test set). Maybe you can inverse the logic to first split the dataset into small locally balanced subsets and then distribute subsets according to training/validation/test ratio. An additional minor issue is that (if I am not mistaken) we do not check that every speaker is present in the training dataset and I think we should do this.

bas0N commented 1 week ago

I have reached the following split: --- Training Set Statistics --- Total Samples: 432 Total Speakers: 20 Authorized Samples: 216 Unauthorized Samples: 216 Authorized to Unauthorized Ratio: 216:216

Samples per Speaker: f1: 36 f10: 17 f2: 17 f3: 18 f4: 21 f5: 17 f6: 14 f7: 36 f8: 36 f9: 11 m1: 15 m10: 17 m2: 15 m3: 36 m4: 11 m5: 15 m6: 36 m7: 15 m8: 36 m9: 13

--- Validation Set Statistics --- Total Samples: 240 Total Speakers: 20 Authorized Samples: 72 Unauthorized Samples: 168 Authorized to Unauthorized Ratio: 72:168

Samples per Speaker: f1: 12 f10: 12 f2: 12 f3: 12 f4: 12 f5: 12 f6: 12 f7: 12 f8: 12 f9: 12 m1: 12 m10: 12 m2: 12 m3: 12 m4: 12 m5: 12 m6: 12 m7: 12 m8: 12 m9: 12

--- Test Set Statistics --- Total Samples: 240 Total Speakers: 20 Authorized Samples: 72 Unauthorized Samples: 168 Authorized to Unauthorized Ratio: 72:168

Samples per Speaker: f1: 12 f10: 12 f2: 12 f3: 12 f4: 12 f5: 12 f6: 12 f7: 12 f8: 12 f9: 12 m1: 12 m10: 12 m2: 12 m3: 12 m4: 12 m5: 12 m6: 12 m7: 12 m8: 12 m9: 12

I think it can be improved to 70/15/15 train/val/test ratio, yet it has very important features:

The results of model training with it can be seen here:

  1. OriginalSizeCNN: https://api.wandb.ai/links/wch-basinski-politechnika-warszawska/6q32y095
  2. TutorialCNN: https://api.wandb.ai/links/wch-basinski-politechnika-warszawska/hs0lfkug
  3. TutorialCNN without standardization: https://api.wandb.ai/links/wch-basinski-politechnika-warszawska/rr915m74