Messed up cross-validation for fold_numb = 20

Svito-zar commented 3 years ago

This is super weird, but everything works fine with doing cross-validation over 10 folds, but the issue appears when K=20.

The issue is that every fold now contains exactly the same frequency for the classes which is super low. With K=10, there are some 100-200 samples in each fold, which are not "zero-vectors", while with K=20, there are only 11 of those and this number (as well as exact frequency for each class) is exactly the same for all the folds ...

It does not happen for K = 12 or K = 16. The issue appears at K = 18

Svito-zar commented 3 years ago

The weird part is that IDs in each fold are clearly different and the number of samples also varies between the folds. It's only the class frequencies that somehow became always the same and unreasonably small.

Svito-zar commented 3 years ago

Now I also discovered that even when K = 10, the frequencies I get from the validation batch are much smaller than those I find in the actual validation dataset: like 120 instead of 770

Svito-zar commented 3 years ago

And this issue is even existing in the leave one out cross-validation

Svito-zar commented 3 years ago

I identified that the issue is in the dataloader, namely here:


    def val_dataloader(self):

        print("\nINSIDE ON THE TRAINING!:\nVal IDs: ", self.val_ids)
        test_y = self.val_dataset.y_dataset[self.val_ids, 2:].astype(int)
        print("\nValidating on ", test_y.shape[0], " samples with ", np.sum(test_y), " ones with values in [",
              np.min(test_y), ":", np.max(test_y), "]")

        print("Bin count: ", np.bincount(test_y[:, 1]))

        val_sampler = torch.utils.data.SequentialSampler(self.val_ids)

        val_batch_size = len(self.val_ids)

        loader = torch.utils.data.DataLoader(
            dataset=self.val_dataset,
            batch_size=val_batch_size,
            num_workers=0,
            #pin_memory=True,
            sampler=val_sampler
        )

            for batch_ndx, data in enumerate(loader):
                print("\nAfter the DATALOADER!:")
                test_y = data["property"][:, 2:].int().numpy()
                print("\nValidating on ", test_y.shape[0], " samples with ", np.sum(test_y), " ones with values in [",
                    np.min(test_y), ":", np.max(test_y), "]")

                print("Bin count: ", np.bincount(test_y[:, 1]))

        return loader

This code returns the following output:

INSIDE ON THE TRAINING!:
Val IDs:  [  842   843   844 ... 70119 70120 70121]

Validating on  7083  samples with  428  ones with values in [ 0 : 1 ]
Bin count:  [6905  178]

After the DATALOADER!:

Validating on  7083  samples with  146  ones with values in [ 0 : 1 ]
Bin count:  [7030   53]

So after the dataset went through the DataLoader - so many values turn into zero! Crazy!

Svito-zar commented 3 years ago

When I fully detached it from the rest of the repo - it works properly:

        test_y = np.array( [[0, 1, 0, 0], [0, 1, 0, 0],  [0, 0, 1, 0], [0, 1, 0, 1],  [1, 0, 0, 0]] )

        print("Val dataset is: ", test_y)

        print("\nValidating on ", test_y.shape, " samples with ", np.sum(test_y), " ones with values in [",
              np.min(test_y), ":", np.max(test_y), "]")

        val_batch_size = 5

        loader = torch.utils.data.DataLoader(
            dataset=test_y,
            batch_size=val_batch_size,
            num_workers=0,
        )

        for batch_ndx, data in enumerate(loader):
            print("\nAfter the DATALOADER!:")
            test_y = data.int().numpy()
            print("\nValidating on ", test_y.shape, " samples with ", np.sum(test_y), " ones with values in [",
                  np.min(test_y), ":", np.max(test_y), "]")

            print(test_y)

        return loader

So the issue has to be somewhere on our side

Svito-zar commented 3 years ago

I have further discovered that when I don't use sampler - the issue disappears

Svito-zar commented 3 years ago

And it is there regardless if I use SequentialSampler or RandomSampler

Svito-zar commented 3 years ago

I finally identified the issue! SequencialSampler like any other Sampler expect the dataset as input, but not indices: https://pytorch.org/docs/stable/data.html?highlight=sequentialsampler#torch.utils.data.SequentialSampler

Svito-zar commented 3 years ago

While torch.utils.data.SubsetRandomSampler used for the training dataset actually works with indices

Svito-zar / speech2properties2gestures

Messed up cross-validation for fold_numb = 20 #8