TTN-YKK / Clustering_friendly_representation_learning

Other
53 stars 9 forks source link

Question with modifying your work for a custom dataset #1

Open juanmed opened 2 years ago

juanmed commented 2 years ago

Hi,

First of all thanks for sharing your work, the results and idea are very good and I am trying to use your work in my own dataset. In contrast to your paper, my dataset has only 2 classes ( 0 and 1), but the image sizes (600x 600) are quite different that CIFAR-10 ( 32x32). I modified the parts of your training code were I determined I needed changes but, after training, all the clustering metrics remain almost constant at low values, and the metrics for both train and test datasets converge to the same value:


4%
88/2000 [1:56:37<42:36:42, 80.23s/it, loss=9.28, loss_fd=4.41, loss_id=4.88]
100%
57/57 [01:19<00:00, 1.29s/it]

Epoch:1 Train set Kmeans ACC, NMI, ARI = 0.5911111111111111, 9.525786354678004e-05, 0.0005876669187708655
Epoch:1 Test set Kmeans ACC, NMI, ARI = 0.5866666666666667, 4.995551521601351e-05, 2.3273675146042313e-05

.... more training epochs .....

Epoch:10 Train set Kmeans ACC, NMI, ARI = 0.52, 0.008587859452158472, -0.007227530774498894
Epoch:10 Test set Kmeans ACC, NMI, ARI = 0.52, 0.008587859452158472, -0.007227530774498894

.... more training epochs .....

Epoch:20 Train set Kmeans ACC, NMI, ARI = 0.5288888888888889, 0.008142569497087198, 0.001965821471942736
Epoch:20 Test set Kmeans ACC, NMI, ARI = 0.5288888888888889, 0.008142569497087198, 0.001965821471942736

.... more training epochs .....

Epoch:30 Train set Kmeans ACC, NMI, ARI = 0.5377777777777778, 0.0004791073409559029, -0.0030825615120523963
Epoch:30 Test set Kmeans ACC, NMI, ARI = 0.5377777777777778, 0.0004791073409559029, -0.0030825615120523963

.... more training epochs .....

Epoch:40 Train set Kmeans ACC, NMI, ARI = 0.5555555555555556, 0.0003241302573988474, 0.0007482397180990874
Epoch:40 Test set Kmeans ACC, NMI, ARI = 0.5555555555555556, 0.0003241302573988474, 0.0007482397180990874

.... more training epochs .....

Epoch:50 Train set Kmeans ACC, NMI, ARI = 0.5466666666666666, 0.00016237588462974682, -5.354551257226367e-05
Epoch:50 Test set Kmeans ACC, NMI, ARI = 0.5466666666666666, 0.00016237588462974682, -5.354551257226367e-05

.... more training epochs .....

Epoch:60 Train set Kmeans ACC, NMI, ARI = 0.5911111111111111, 0.004722642422342852, 0.010586763219130104
Epoch:60 Test set Kmeans ACC, NMI, ARI = 0.5911111111111111, 0.004722642422342852, 0.010586763219130104

.... more training epochs .....

Epoch:70 Train set Kmeans ACC, NMI, ARI = 0.5822222222222222, 0.004002797256940793, 0.00854729254795711
Epoch:70 Test set Kmeans ACC, NMI, ARI = 0.5822222222222222, 0.004002797256940793, 0.00854729254795711

.... more training epochs .....

Epoch:80 Train set Kmeans ACC, NMI, ARI = 0.6044444444444445, 0.0022726849901668618, 0.008573041640785145
Epoch:80 Test set Kmeans ACC, NMI, ARI = 0.6044444444444445, 0.0022726849901668618, 0.008573041640785145

.... more training epochs .....

I get the metrics for test dataset by simply doing:

acc, nmi, ari = check_clustering_metrics(npc, test_loader)

The train dataset has 227 images and the test one 75. The only hyperparameters I modified are:

batch_size = 4
lr=0.1,
Input size = 400 x 400  ( I donwsize the images from 600 to 400 px)

Finally, when I see the losses loss_id and loss_fd I see that while loss_id slowly decreases, loss_fd is very constant at around 4..4

image

I will appreciate your kind comments on what parameters to try and change to get the learning process working correctly. I am using Google Colab which has GPU of 15GB.

Thank you!

TTN-YKK commented 2 years ago

Hi, thank you for trying our IDFD.

I get the metrics for test dataset by simply doing: acc, nmi, ari = check_clustering_metrics(npc, test_loader)

check_clustering_metrics is designed for a training dataset since the function uses feature vectors in the memory bank in NPC. Please modify the code to calculate feature vectors of your test dataset using the trained net.

batch_size = 4

Since the FD term depends on batch_size as described in our paper, we recommend a larger batch_size. We think this is the reason why loss_fd is constant.

Input size = 400 x 400

Our training code is designed for small-size images (ex. cifar10) and resnet18 is customized for cifar10. Then, we recommend changing our customized resnet18 to standard resnet18 for imagenet.

juanmed commented 2 years ago

Hi, thanks for your reply back.

I tried the changes you suggested. I created an npc for the test dataset and used the trained net to calculate its feature vectures, as follows:

features = norm(net(inputs))
outputs = test_npc(features, indexes)

where inputs and indexes come from the test_loader in a similar fashion to the train_loader.

I also augmented the batch_size to 128, similar to CIFAR10 tests and used a vanilla resnet18 from pytorch, and modified the output layer to 128 nodes:

def normalResnet18(low_dim = 128):
    backbone = models.resnet18(pretrained=True)
    n_inputs = backbone.fc.in_features
    linear1 = nn.Linear(n_inputs, low_dim)
    backbone.fc = nn.Sequential(linear1)
    return backbone

I trained for 150 epochs with learning_rate = 0.1, and got the following for the losses:

image

As you can see, the loss_fd is almost constant. When I check the values for loss_fd, it is actually decreasing very slowly: Starts at 4.4030 when epoch=0 and finished at 4.3611 when epoch=143.

I am trying with different learning rates, but I was wondering if you had any further comments? It seems that either my learning rate is too small, or the network is rapidly finding a local minimum such that loss does not decrease further which would suggest a learning rate too high.

TTN-YKK commented 2 years ago

I tried the changes you suggested. I created an npc for the test dataset and used the trained net to calculate its feature vectures, as follows: features = norm(net(inputs)) outputs = test_npc(features, indexes) where inputs and indexes come from the test_loader in a similar fashion to the train_loader.

Since the memory bank in npc will be updated at the backpropagation, features = norm(net(inputs)) of test datasets should be used as feature vectors for calculating metrics.

As you can see, the loss_fd is almost constant. When I check the values for loss_fd, it is actually decreasing very slowly: Starts at 4.4030 when epoch=0 and finished at 4.3611 when epoch=143.

In experments on CIFAR10, ID loss is slowly decreasing and FD is more slowly decreasing as shown in the following figures. image image

I am trying with different learning rates, but I was wondering if you had any further comments?

  1. Since good learning rates depend on datasets, there's no other way but trial and error.
  2. As described in our paper, data augmentations are very important and suitable data augmentations depend on datasets and tasks. It is better to try different methods and data augmentation parameters.
  3. We think your dataset is very small. If your task is clustering, we recommend combining train datasets and test datasets to train a network.
juanmed commented 2 years ago

@TTN-YKK Thank you for your reply.

Since the memory bank in npc will be updated at the backpropagation, features = norm(net(inputs)) of test datasets should be used as feature vectors for calculating metrics.

I am not sure I understand this correctly. The way I am evaluating the metrics for the test set is:

            net.eval()
            with torch.no_grad():
              for batch_idx, (inputs, _,
                          indexes) in enumerate(tqdm.tqdm(test_loader)):
                inputs = inputs.to(device, dtype=torch.float32, non_blocking=True)
                indexes = indexes.to(device, non_blocking=True)
                features = norm(net(inputs))
                outputs = test_npc(features, indexes)
                test_loss_id, test_loss_fd = test_loss(outputs, features, indexes)
                tot_test_loss = loss_id + loss_fd
                # track test loss
                test_trackers["loss"].add(tot_test_loss)
                test_trackers["loss_id"].add(test_loss_id)
                test_trackers["loss_fd"].add(test_loss_fd)
              test_trackers["loss"].save_avg()
              test_trackers["loss_id"].save_avg()
              test_trackers["loss_fd"].save_avg()
              test_acc, test_nmi, test_ari = check_clustering_metrics(test_npc, test_loader)

Your graphs are very helpful, then losses do decrease very slowly and the values I am getting might be correct. I will test with various learning rates, augmentations and dataset splits.

TTN-YKK commented 2 years ago

I am not sure I understand this correctly.

My suggestion to evaluate test datsets is following code.

test_features = []
with torch.no_grad():
    for batch_idx, (inputs, _, _) in enumerate(tqdm.tqdm(test_loader)):
        inputs = inputs.to(device, dtype=torch.float32, non_blocking=True)
        features = norm(net(inputs))
        test_features.append(features.cpu().numpy())
    test_features = np.concatenate(test_features)
    y = np.array(test_loader.dataset.targets)
    n_clusters = len(np.unique(y))
    kmeans = KMeans(n_clusters=n_clusters)
    y_pred = kmeans.fit_predict(test_features)
    test_acc, test_nmi, test_ari = metrics.acc(y, y_pred), metrics.nmi(y, y_pred), metrics.ari(y, y_pred)