Different Accuracies! - Githubissues

ZohrehAdabi commented 2 years ago

Hi @mpatacchiola I'm using DKT code for classification on CUB [5 way- 1 shot]. I save two models during meta-training, best_model and last_model. in test.py I add some code to test best_model and last_model. I create two instantiates of DKT class:

elif params.method == 'DKT':
        # model           = DKT(model_dict[params.model], **few_shot_params)
        last_model      = DKT(model_dict[params.model], **few_shot_params)
        best_model      = DKT(model_dict[params.model], **few_shot_params)

and load saved files for each of them:

        best, last = True, True
        modelfile = None
        if params.save_iter != -1:
            modelfile   = get_assigned_file(checkpoint_dir,params.save_iter)
        if best:
            best_modelfile   = get_best_file(checkpoint_dir)
            print(f'\n best model {best_modelfile}')
        if last:
            files = os.listdir(checkpoint_dir)
            nums =  [int(f.split('.')[0]) for f in files if 'best' not in f]
            num = max(nums)
            print(f'\nModel at last epoch {num}')
            last_modelfile = os.path.join(checkpoint_dir, '{:d}.tar'.format(num))
            print(f'\nlast model {last_modelfile}\n')
       if best and best_modelfile is not None:
            best_model = best_model.cuda()
            tmp = torch.load(best_modelfile)
            best_model.load_state_dict(tmp['state'])

        if last and last_modelfile is not None:
            last_model = last_model.cuda()
            tmp = torch.load(last_modelfile)
            last_model.load_state_dict(tmp['state'])

when I run test.py for these models:

         if last:
            print('last')
            last_model.eval()
            acc_mean, acc_std = last_model.test_loop( novel_loader, return_std = True)
            print("-----------------------------")
            print('Test Acc last model = %4.2f%% +- %4.2f%%' %(acc_mean, acc_std))
            print("-----------------------------") 

        if best:
            print('Best') 
            best_model.eval()
            acc_mean, acc_std = best_model.test_loop( novel_loader, return_std = True)
            print("-----------------------------")
            print('Test Acc best model = %4.2f%% +- %4.2f%%' %(acc_mean, acc_std))
            print("-----------------------------")

I have a problem with ACC. If I run best_model or last model alone, I get some ACCs; If I run both of them (best, last= True, True) I get different ACC for the model that is run in the second step. (running last_model after best_model test, changes the ACC of last_model.)

gpytorch 1.6.0 torch 1.10.0

Could you please help me to figure out what the problem is? I really appreciate any help you can provide.

mpatacchiola commented 2 years ago

Hi @ZohrehAdabi

Can it be a problem related to the batch-norm statistics?

Since the backbone uses batch-norm, you should carefully use the calls to model.train() and model.eval() to avoid any update in the batch-norm statistics. Give a look at this discussion for more details.

ZohrehAdabi commented 2 years ago

I've checked it already @mpatacchiola. In correct() function you used .eval():

 with torch.no_grad(), gpytorch.settings.num_likelihood_samples(32):
            self.model.eval()
            self.likelihood.eval()
            self.feature_extractor.eval()
            z_query = self.feature_extractor.forward(x_query).detach()
            if(self.normalize): z_query = F.normalize(z_query, p=2, dim=1)
            z_query_list = [z_query]*len(y_query)
            predictions = self.likelihood(*self.model(*z_query_list)) #return n_way MultiGaussians
            predictions_list = list()
            for gaussian in predictions:
                predictions_list.append(torch.sigmoid(gaussian.mean).cpu().detach().numpy())
            y_pred = np.vstack(predictions_list).argmax(axis=0) #[model, classes]
            top1_correct = np.sum(y_pred == y_query)
            count_this = len(y_query)
        return float(top1_correct), count_this, avg_loss/float(N+1e-10)

also before calling model.test_loop:

        model.eval()
        acc_mean, acc_std = model.test_loop( novel_loader, return_std = True)

Is there anything else should I check? Thanks.

mpatacchiola commented 2 years ago

Hi @ZohrehAdabi

Things I would check are the following:

Normalization of the images, are you using the same normalization in both functions?
Dataloader, is it possible that the dataloader does not complete the loop over the entire set of data? This could happen in one of the test calls and not in the other, producing the discrepancy. There should be a parameter called drop_last that must be set to False in test mode.
We have used a much older version of GPytorch. I would check if something has been changed in the kernels. I suggest you to try different kernels by changing the kernel_type parameter in the config.py file. You could switch to the linear kernel for instance, which is much simpler than the others and it is unlikely that has been changed in the latest version of the library.

ZohrehAdabi commented 2 years ago

Hi @mpatacchiola

I used test.py from here and just added some code for testing two models instead of one, so both models use the same data_loader and have the same transformation (including normalization):

    datamgr         = SetDataManager(image_size, n_eposide = iter_num, n_query = 15 , **few_shot_params)

    if params.dataset == 'cross':
        if split == 'base':
            loadfile = configs.data_dir['miniImagenet'] + 'all.json' 
        else:
            loadfile   = configs.data_dir['CUB'] + split +'.json'
    elif params.dataset == 'cross_char':
        if split == 'base':
            loadfile = configs.data_dir['omniglot'] + 'noLatin.json' 
        else:
            loadfile  = configs.data_dir['emnist'] + split +'.json' 
    else: 
        loadfile    = configs.data_dir[params.dataset] + split + '.json'

    novel_loader     = datamgr.get_data_loader( loadfile, aug = False)

Based on torch data_loader Docs, drop_last is False by default. In Meta-learning of DKT, there is a for loop on tasks. In each iteration whole data of one task is used for optimization, and all tasks have the same size. Therefore we are using the same data for two tests by novel_dataloader. Is this the case?
and for simplicity, I also used a linear kernel. Thanks.

mpatacchiola commented 2 years ago

@ZohrehAdabi I am not sure where the problem can be.

It can be an issue of the dataloader. You could try to comment-out the data manager lines and pass synthetic tasks (e.g. images of random Gaussian noise) that you build in advance. You can just create your own task as a tensor and use it in the two phases to see if the output changes. If the accuracy stays the same this would strongly suggest that the culprit is the data manager.

If the test above gives you the same outcome then something I would try is to use the same code with another model (e.g. ProtoNets) to see if the issue is due to some of the code in DKT.py.

ZohrehAdabi commented 2 years ago

Hi @mpatacchiola Using random tasks like this

        tasks = []
        for i in range(5):

            data_0  = torch.randn([16, 3, 84, 84])
            data_1  = torch.randn([16, 3, 84, 84])
            data_2  = torch.randn([16, 3, 84, 84])
            data_3  = torch.randn([16, 3, 84, 84])
            data_4  = torch.randn([16, 3, 84, 84])

            data = torch.stack([data_0, data_1, data_2, data_3, data_4])
            tasks.append(data)

and using model.correct(tasks[i]) accuracies stay the same. Thank you. I test each model separately to use the same data from data_loader randomness and have a safe test. Do you think why data_manager creates such an issue?

ZohrehAdabi commented 2 years ago

There is another problem with ACC! I define two models at the same time,

    elif params.method == 'DKT':

        last_model      = DKT(model_dict[params.model], **few_shot_params)
        best_model      = DKT(model_dict[params.model], **few_shot_params)

and in the remainder of the code by using boolean variables best and last controlled which models run. By using random tasks, there is no change in ACCs when I change the sequence of tests for best_model and last model. But, when I comment definition of one model like this:

    elif params.method == 'DKT':

        #last_model      = DKT(model_dict[params.model], **few_shot_params)
        best_model      = DKT(model_dict[params.model], **few_shot_params)

I get a higher ACC for the other model. Why definition of the models affects the test? [I used the random tasks and this happens.](models have different id and use their own stat_dicts as before.)

mpatacchiola commented 2 years ago

The data_manager seems to be the issue for the previous problem.

For the second problem with the two models, are you loading the same pretrained models for both objects or did you initialize them from scratch?

ZohrehAdabi commented 2 years ago

Hi @mpatacchiola I initialize them from scratch:

        #best, last = True, True
        best, last = True, False
       if best and best_modelfile is not None:
            best_model = best_model.cuda()
            tmp = torch.load(best_modelfile)
            best_model.load_state_dict(tmp['state'])

        if last and last_modelfile is not None:
            last_model = last_model.cuda()
            tmp = torch.load(last_modelfile)
            last_model.load_state_dict(tmp['state'])

the run test_loop(). When last_model and best_model are Instantiated but only best_model is initialized and best_model.test_loop() is run, then I get some accuracy for best_model. But when, in another run of test.py, the last_model is commented, best_model has a different (higher) accuracy. Can creating an Instance of a model affect the other Instances of that? Thanks.

BayesWatch / deep-kernel-transfer

Different Accuracies! #15