To ensure that each class is fairly represented in the split into training/validation/testing, we should make the split for each class separately and combine these splits later on.
This comes with a change in meaning of the cmd line args train-size, test-size and val-size: Before they meant the size of the overall dataset - with this change they would describe the size of the split of one folder. So the total size of e.g. the training set will be # classes * train-size instead of train-size.
I also tried around with parallelizing the loading of the data. However, I found it kind of fiddly to synchronize the order of the labels. One could use that the label for the split of one folder stays constant and load the images for each folder separately, but I haven't done that.
Nevertheless, while trying around I parallelized the loading of the file list. This is hardly necessary but achieves minor speed gains (~2sec for large data sets), so I would suggest to add it anyway.
To ensure that each class is fairly represented in the split into training/validation/testing, we should make the split for each class separately and combine these splits later on.
This comes with a change in meaning of the cmd line args
train-size
,test-size
andval-size
: Before they meant the size of the overall dataset - with this change they would describe the size of the split of one folder. So the total size of e.g. the training set will be# classes
*train-size
instead oftrain-size
.I also tried around with parallelizing the loading of the data. However, I found it kind of fiddly to synchronize the order of the labels. One could use that the label for the split of one folder stays constant and load the images for each folder separately, but I haven't done that.
Nevertheless, while trying around I parallelized the loading of the file list. This is hardly necessary but achieves minor speed gains (~2sec for large data sets), so I would suggest to add it anyway.