BayesWatch / deep-kernel-transfer

Official pytorch implementation of the paper "Bayesian Meta-Learning for the Few-Shot Setting via Deep Kernels" (NeurIPS 2020)
https://arxiv.org/abs/1910.05199
197 stars 29 forks source link

Different Accuracies! #15

Closed ZohrehAdabi closed 2 years ago

ZohrehAdabi commented 2 years ago

Hi @mpatacchiola I'm using DKT code for classification on CUB [5 way- 1 shot]. I save two models during meta-training, best_model and last_model. in test.py I add some code to test best_model and last_model. I create two instantiates of DKT class:

elif params.method == 'DKT':
        # model           = DKT(model_dict[params.model], **few_shot_params)
        last_model      = DKT(model_dict[params.model], **few_shot_params)
        best_model      = DKT(model_dict[params.model], **few_shot_params)

and load saved files for each of them:

        best, last = True, True
        modelfile = None
        if params.save_iter != -1:
            modelfile   = get_assigned_file(checkpoint_dir,params.save_iter)
        if best:
            best_modelfile   = get_best_file(checkpoint_dir)
            print(f'\n best model {best_modelfile}')
        if last:
            files = os.listdir(checkpoint_dir)
            nums =  [int(f.split('.')[0]) for f in files if 'best' not in f]
            num = max(nums)
            print(f'\nModel at last epoch {num}')
            last_modelfile = os.path.join(checkpoint_dir, '{:d}.tar'.format(num))
            print(f'\nlast model {last_modelfile}\n')
       if best and best_modelfile is not None:
            best_model = best_model.cuda()
            tmp = torch.load(best_modelfile)
            best_model.load_state_dict(tmp['state'])

        if last and last_modelfile is not None:
            last_model = last_model.cuda()
            tmp = torch.load(last_modelfile)
            last_model.load_state_dict(tmp['state'])

when I run test.py for these models:

         if last:
            print('last')
            last_model.eval()
            acc_mean, acc_std = last_model.test_loop( novel_loader, return_std = True)
            print("-----------------------------")
            print('Test Acc last model = %4.2f%% +- %4.2f%%' %(acc_mean, acc_std))
            print("-----------------------------") 

        if best:
            print('Best') 
            best_model.eval()
            acc_mean, acc_std = best_model.test_loop( novel_loader, return_std = True)
            print("-----------------------------")
            print('Test Acc best model = %4.2f%% +- %4.2f%%' %(acc_mean, acc_std))
            print("-----------------------------") 

I have a problem with ACC. If I run best_model or last model alone, I get some ACCs; If I run both of them (best, last= True, True) I get different ACC for the model that is run in the second step. (running last_model after best_model test, changes the ACC of last_model.)

gpytorch 1.6.0 torch 1.10.0

Could you please help me to figure out what the problem is? I really appreciate any help you can provide.

mpatacchiola commented 2 years ago

Hi @ZohrehAdabi

Can it be a problem related to the batch-norm statistics?

Since the backbone uses batch-norm, you should carefully use the calls to model.train() and model.eval() to avoid any update in the batch-norm statistics. Give a look at this discussion for more details.

ZohrehAdabi commented 2 years ago

I've checked it already @mpatacchiola. In correct() function you used .eval():

 with torch.no_grad(), gpytorch.settings.num_likelihood_samples(32):
            self.model.eval()
            self.likelihood.eval()
            self.feature_extractor.eval()
            z_query = self.feature_extractor.forward(x_query).detach()
            if(self.normalize): z_query = F.normalize(z_query, p=2, dim=1)
            z_query_list = [z_query]*len(y_query)
            predictions = self.likelihood(*self.model(*z_query_list)) #return n_way MultiGaussians
            predictions_list = list()
            for gaussian in predictions:
                predictions_list.append(torch.sigmoid(gaussian.mean).cpu().detach().numpy())
            y_pred = np.vstack(predictions_list).argmax(axis=0) #[model, classes]
            top1_correct = np.sum(y_pred == y_query)
            count_this = len(y_query)
        return float(top1_correct), count_this, avg_loss/float(N+1e-10)

also before calling model.test_loop:

        model.eval()
        acc_mean, acc_std = model.test_loop( novel_loader, return_std = True)

Is there anything else should I check? Thanks.

mpatacchiola commented 2 years ago

Hi @ZohrehAdabi

Things I would check are the following:

ZohrehAdabi commented 2 years ago

Hi @mpatacchiola

mpatacchiola commented 2 years ago

@ZohrehAdabi I am not sure where the problem can be.

It can be an issue of the dataloader. You could try to comment-out the data manager lines and pass synthetic tasks (e.g. images of random Gaussian noise) that you build in advance. You can just create your own task as a tensor and use it in the two phases to see if the output changes. If the accuracy stays the same this would strongly suggest that the culprit is the data manager.

If the test above gives you the same outcome then something I would try is to use the same code with another model (e.g. ProtoNets) to see if the issue is due to some of the code in DKT.py.

ZohrehAdabi commented 2 years ago

Hi @mpatacchiola Using random tasks like this

        tasks = []
        for i in range(5):

            data_0  = torch.randn([16, 3, 84, 84])
            data_1  = torch.randn([16, 3, 84, 84])
            data_2  = torch.randn([16, 3, 84, 84])
            data_3  = torch.randn([16, 3, 84, 84])
            data_4  = torch.randn([16, 3, 84, 84])

            data = torch.stack([data_0, data_1, data_2, data_3, data_4])
            tasks.append(data)

and using model.correct(tasks[i]) accuracies stay the same. Thank you. I test each model separately to use the same data from data_loader randomness and have a safe test. Do you think why data_manager creates such an issue?

ZohrehAdabi commented 2 years ago

There is another problem with ACC! I define two models at the same time,

    elif params.method == 'DKT':

        last_model      = DKT(model_dict[params.model], **few_shot_params)
        best_model      = DKT(model_dict[params.model], **few_shot_params)

and in the remainder of the code by using boolean variables best and last controlled which models run. By using random tasks, there is no change in ACCs when I change the sequence of tests for best_model and last model. But, when I comment definition of one model like this:

    elif params.method == 'DKT':

        #last_model      = DKT(model_dict[params.model], **few_shot_params)
        best_model      = DKT(model_dict[params.model], **few_shot_params)

I get a higher ACC for the other model. Why definition of the models affects the test? [I used the random tasks and this happens.](models have different id and use their own stat_dicts as before.)

mpatacchiola commented 2 years ago

The data_manager seems to be the issue for the previous problem.

For the second problem with the two models, are you loading the same pretrained models for both objects or did you initialize them from scratch?

ZohrehAdabi commented 2 years ago

Hi @mpatacchiola I initialize them from scratch:

        #best, last = True, True
        best, last = True, False
       if best and best_modelfile is not None:
            best_model = best_model.cuda()
            tmp = torch.load(best_modelfile)
            best_model.load_state_dict(tmp['state'])

        if last and last_modelfile is not None:
            last_model = last_model.cuda()
            tmp = torch.load(last_modelfile)
            last_model.load_state_dict(tmp['state'])

the run test_loop(). When last_model and best_model are Instantiated but only best_model is initialized and best_model.test_loop() is run, then I get some accuracy for best_model. But when, in another run of test.py, the last_model is commented, best_model has a different (higher) accuracy. Can creating an Instance of a model affect the other Instances of that? Thanks.