Closed Svito-zar closed 3 years ago
The weird part is that IDs in each fold are clearly different and the number of samples also varies between the folds. It's only the class frequencies that somehow became always the same and unreasonably small.
Now I also discovered that even when K = 10, the frequencies I get from the validation batch are much smaller than those I find in the actual validation dataset: like 120 instead of 770
And this issue is even existing in the leave one out
cross-validation
I identified that the issue is in the dataloader, namely here:
def val_dataloader(self):
print("\nINSIDE ON THE TRAINING!:\nVal IDs: ", self.val_ids)
test_y = self.val_dataset.y_dataset[self.val_ids, 2:].astype(int)
print("\nValidating on ", test_y.shape[0], " samples with ", np.sum(test_y), " ones with values in [",
np.min(test_y), ":", np.max(test_y), "]")
print("Bin count: ", np.bincount(test_y[:, 1]))
val_sampler = torch.utils.data.SequentialSampler(self.val_ids)
val_batch_size = len(self.val_ids)
loader = torch.utils.data.DataLoader(
dataset=self.val_dataset,
batch_size=val_batch_size,
num_workers=0,
#pin_memory=True,
sampler=val_sampler
)
for batch_ndx, data in enumerate(loader):
print("\nAfter the DATALOADER!:")
test_y = data["property"][:, 2:].int().numpy()
print("\nValidating on ", test_y.shape[0], " samples with ", np.sum(test_y), " ones with values in [",
np.min(test_y), ":", np.max(test_y), "]")
print("Bin count: ", np.bincount(test_y[:, 1]))
return loader
This code returns the following output:
INSIDE ON THE TRAINING!:
Val IDs: [ 842 843 844 ... 70119 70120 70121]
Validating on 7083 samples with 428 ones with values in [ 0 : 1 ]
Bin count: [6905 178]
After the DATALOADER!:
Validating on 7083 samples with 146 ones with values in [ 0 : 1 ]
Bin count: [7030 53]
So after the dataset went through the DataLoader - so many values turn into zero! Crazy!
When I fully detached it from the rest of the repo - it works properly:
test_y = np.array( [[0, 1, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 1, 0, 1], [1, 0, 0, 0]] )
print("Val dataset is: ", test_y)
print("\nValidating on ", test_y.shape, " samples with ", np.sum(test_y), " ones with values in [",
np.min(test_y), ":", np.max(test_y), "]")
val_batch_size = 5
loader = torch.utils.data.DataLoader(
dataset=test_y,
batch_size=val_batch_size,
num_workers=0,
)
for batch_ndx, data in enumerate(loader):
print("\nAfter the DATALOADER!:")
test_y = data.int().numpy()
print("\nValidating on ", test_y.shape, " samples with ", np.sum(test_y), " ones with values in [",
np.min(test_y), ":", np.max(test_y), "]")
print(test_y)
return loader
So the issue has to be somewhere on our side
I have further discovered that when I don't use sampler
- the issue disappears
And it is there regardless if I use SequentialSampler
or RandomSampler
I finally identified the issue!
SequencialSampler
like any other Sampler
expect the dataset as input, but not indices:
https://pytorch.org/docs/stable/data.html?highlight=sequentialsampler#torch.utils.data.SequentialSampler
While torch.utils.data.SubsetRandomSampler
used for the training dataset actually works with indices
This is super weird, but everything works fine with doing cross-validation over 10 folds, but the issue appears when K=20.
The issue is that every fold now contains exactly the same frequency for the classes which is super low. With K=10, there are some 100-200 samples in each fold, which are not "zero-vectors", while with K=20, there are only 11 of those and this number (as well as exact frequency for each class) is exactly the same for all the folds ...
It does not happen for K = 12 or K = 16. The issue appears at K = 18