gan-police / frequency-forensics

Deepfake detection using wavelet-packets in PyTorch, European Conference on Machine Learning (ECML PKDD) 2022.
Other
47 stars 9 forks source link

Feature: Separate data split #5

Closed felixblanke closed 3 years ago

felixblanke commented 3 years ago

To ensure that each class is fairly represented in the split into training/validation/testing, we should make the split for each class separately and combine these splits later on.

This comes with a change in meaning of the cmd line args train-size, test-size and val-size: Before they meant the size of the overall dataset - with this change they would describe the size of the split of one folder. So the total size of e.g. the training set will be # classes * train-size instead of train-size.

I also tried around with parallelizing the loading of the data. However, I found it kind of fiddly to synchronize the order of the labels. One could use that the label for the split of one folder stays constant and load the images for each folder separately, but I haven't done that.

Nevertheless, while trying around I parallelized the loading of the file list. This is hardly necessary but achieves minor speed gains (~2sec for large data sets), so I would suggest to add it anyway.