Confusezius / ECCV2020_DiVA_MultiFeature_DML

(ECCV 2020) This repo contains code for "DiVA: Diverse Visual Feature Aggregation for Deep Metric Learning" (https://arxiv.org/abs/2004.13458), which extends vanilla DML with auxiliary and self-supervised features.
34 stars 11 forks source link

worse reproduced results #1

Closed Dyfine closed 3 years ago

Dyfine commented 3 years ago

Hi, thanks for your great work and repo of DiVA! Currently I'm reproducing your paper results with this repo and I run ECCV2020_DiVA_SampleRuns.sh on CUB (resnet50). However, the best result I get is R@1=68.35, which is worse than those reported in the paper, where R@1 should be 69.2. I use pytorch1.8.0.dev+faiss_gpu1.4.0+cuda11 on one 3090 GPU and the detailed results are shown below. I wonder is there any problem with the environment settings or should I modify the settings in ECCV2020_DiVA_SampleRuns.sh? Graph_Test_Combined_discriminative_selfsimilarity_shared_intra-0 5-1-1-1_e_recall Graph_Test_Combined_discriminative_selfsimilarity_shared_intra-0 5-1-1-1_nmi Graph_Test_Combined_discriminative_selfsimilarity_shared_intra-0 5-1 5-1 5-1 5_e_recall Graph_Test_Combined_discriminative_selfsimilarity_shared_intra-0 5-1 5-1 5-1 5_nmi Graph_Test_Combined_discriminative_selfsimilarity_shared_intra-0 75-1 25-1 25-1 25_e_recall Graph_Test_Combined_discriminative_selfsimilarity_shared_intra-0 75-1 25-1 25-1 25_nmi Besides, I also try another environment setting, pytorch1.5.1+faiss_gpu1.6.3+cuda10 on two 2080Ti GPU (I add torch.nn.DataParallel to model and selfsim_model, since it can't fit into one 2080Ti), and the best results are R@1=68.26 (0.75-1.25-1.25-1.25) NMI=71.20 (0.5-1-1-1). Thanks.

Confusezius commented 3 years ago

Hi there! Thanks for adding all the info when opening this issue, makes it much easier to debug :).

So while the setting you evaluate on is different to the one we trained/tested on, the changes shouldn't be as significant.

The thing that comes first to mind would be to check the influence of the single scheduling step, as convergence behaviour may be influenced a bit, i.e. adjusting the default --tau 55 --gamma 0.2 to something in e.g. --tau [40, ..., 80] and trying --gamma [0.1, 0.3], or testing two scheduling steps s.a. --tau 60 90 --gamma 0.3 (although this shouldn't be necessary).

Finally, due to the more complex nature of training, there is some higher seed-based dependence of the final performance, so trying different seeds may be helpful to check if it is not just a seed-based deviation (i.e. setting --seed 0/1/2/3/4).

Just to be sure I checked again, and with --tau 60 --gamma 0.3 I get results as shown in the image attached.

Also note that on CUB, you for sure don't have to train for 350 epochs, I easily get the best performance under 175/200 epochs :).

Screenshot 2020-12-23 at 15 39 20

Dyfine commented 3 years ago

Thanks for your detailed reply! I will have a try and reply to you after my experiments.

Dyfine commented 3 years ago

Hi @Confusezius , I conduct some experiments but still can't get a 69.2 R@1. This time I also use the first environment setting (pytorch1.8.0.dev+faiss_gpu1.4.0+cuda11), and the best R@1 results I get are:

different seed

  1. --tau 60 --gamma 0.3 --n_epochs 200 --seed 0 68.40
  2. --tau 60 --gamma 0.3 --n_epochs 200 --seed 1 67.29
  3. --tau 60 --gamma 0.3 --n_epochs 200 --seed 2 67.52
  4. --tau 60 --gamma 0.3 --n_epochs 200 --seed 3 68.35
  5. --tau 60 --gamma 0.3 --n_epochs 200 --seed 4 67.78

different scheduling

  1. --tau 80 --gamma 0.3 --n_epochs 200 --seed 0 68.20
  2. --tau 90 --gamma 0.3 --n_epochs 200 --seed 0 68.01
  3. --tau 60 90 --gamma 0.3 --n_epochs 200 --seed 0 68.55 (the best I get among 8 settings)

The R@1 plot of setting 1 is shown below. dl2

The R@1 plot of setting 8 is also shown below. dl1 I find that the R@1 of epoch 60 is about 0.65, which is already worse than your result, about 0.67.

The parameter info of setting 8 is shown below.

dataset
    cub200

train_val_split
    1

lr
    1e-05

fc_lr
    -1

n_epochs
    200

kernels
    8

bs
    112

seed
    0

scheduler
    step

gamma
    0.3

decay
    0.0004

tau
    [60, 90]

use_sgd
    False

loss
    margin

batch_mining
    distance

extension
    none

embed_dim
    128

arch
    resnet50_frozen_normalize

not_pretrained
    False

evaluation_metrics
    ['e_recall@1', 'e_recall@2', 'e_recall@4', 'nmi', 'f1', 'mAP_c']

evaltypes
    ['Combined_discriminative_selfsimilarity_shared_intra-0.75-1.25-1.25-1.25', 'Combined_discriminative_selfsimilarity_shared_intra-0.5-1-1-1', 'Combined_discriminative_selfsimilarity_shared_intra-0.5-1.5-1.5-1.5']

storage_metrics
    ['e_recall@1']

realistic_augmentation
    False

realistic_main_augmentation
    False

gpu
    [0]

savename

source_path
    ./datasets/cub200

save_path
    /data/dyfine/ECCV2020_DiVA_MultiFeature_DML-master/Training_Results/cub200/CUB200_RESNET50_FROZEN_NORMALIZE_2020-12-24-23-54-53

data_sampler
    class_random

samples_per_class
    2

data_batchmatch_bigbs
    512

data_batchmatch_ncomps
    10

data_storage_no_update
    False

data_d2_coreset_lambda
    1

data_gc_coreset_lim
    1e-09

data_sampler_lowproj_dim
    -1

data_sim_measure
    euclidean

data_gc_softened
    False

data_idx_full_prec
    False

data_mb_mom
    -1

data_mb_lr
    1

miner_distance_lower_cutoff
    0.5

miner_distance_upper_cutoff
    1.4

loss_contrastive_pos_margin
    0

loss_contrastive_neg_margin
    1

loss_triplet_margin
    0.2

loss_margin_margin
    0.2

loss_margin_beta_lr
    0.0005

loss_margin_beta
    1.2

loss_margin_nu
    0

loss_margin_beta_constant
    False

loss_proxynca_lr
    0.0005

loss_npair_l2
    0.005

loss_angular_alpha
    36

loss_angular_npair_ang_weight
    2

loss_angular_npair_l2
    0.005

loss_multisimilarity_pos_weight
    2

loss_multisimilarity_neg_weight
    40

loss_multisimilarity_margin
    0.1

loss_multisimilarity_thresh
    0.5

loss_lifted_neg_margin
    1

loss_lifted_l2
    0.005

loss_binomial_pos_weight
    2

loss_binomial_neg_weight
    40

loss_binomial_margin
    0.1

loss_binomial_thresh
    0.5

loss_quadruplet_alpha1
    1

loss_quadruplet_alpha2
    0.5

loss_softtriplet_n_centroids
    10

loss_softtriplet_margin_delta
    0.01

loss_softtriplet_gamma
    0.1

loss_softtriplet_lambda
    20

loss_softtriplet_reg_weight
    0.2

loss_softtriplet_lr
    0.0005

loss_softmax_lr
    1e-05

loss_softmax_temperature
    0.05

loss_histogram_nbins
    51

loss_snr_margin
    0.2

loss_snr_reg_lambda
    0.005

loss_snr_beta
    0

loss_snr_beta_lr
    0.0005

loss_arcface_lr
    0.0005

loss_arcface_angular_margin
    0.5

loss_arcface_feature_scale
    64

loss_quadruplet_margin_alpha_1
    0.2

loss_quadruplet_margin_alpha_2
    0.2

log_online
    False

wandb_key
    <your_api_key_here>

project
    DiVA_SampleRuns

group
    CUB_DiVA-R50-512

diva_ssl
    fast_moco

diva_sharing
    random

diva_intra
    random

diva_features
    ['discriminative', 'selfsimilarity', 'shared', 'intra']

diva_decorrelations
    ['selfsimilarity-discriminative', 'shared-discriminative', 'intra-discriminative']

diva_rho_decorrelation
    [1500.0, 1500.0, 1500.0]

diva_decorrnet_dim
    512

diva_decorrnet_lr
    1e-05

diva_instdiscr_temperature
    0.1

diva_dc_update_f
    2

diva_dc_ncluster
    300

diva_moco_momentum
    0.9

diva_moco_temperature
    0.01

diva_moco_n_key_batches
    30

diva_moco_lower_cutoff
    0.5

diva_moco_upper_cutoff
    1.4

diva_moco_temp_lr
    0.0005

diva_moco_trainable_temp
    False

diva_alpha_ssl
    0.3

diva_alpha_shared
    0.3

diva_alpha_intra
    0.3

pretrained
    True

device
    cuda

network_feature_dim
    2048

n_classes
    100

I have checked my cub dataset, which is the same as the one in https://github.com/Confusezius/Revisiting_Deep_Metric_Learning_PyTorch. The only thing I change about the code is in adversarial_seperation.py, due to the requirement of pytorch1.8 (also pytorch1.5), as shown below.

class GradRev(torch.autograd.Function):

    @staticmethod
    def forward(ctx, x):

        return x.view_as(x)

    @staticmethod
    def backward(ctx, grad_output):

        return (grad_output * -1.)

def grad_reverse(x):

    return GradRev.apply(x)

I still can't find out what's the problem. Do you have any idea about this?

Confusezius commented 3 years ago

That is really quite weird - there are some things you can try to generally improve performance: [1] Train the 68.55 - run for 300 epochs to see if the performance still improves, just for completeness. [2] Adjust the adversarial weightings --diva_rho_decorrelation to e.g. [1000, 1000, 1000] or [2000, 2000, 2000]or adjust the weight terms--diva_alpha_ssl\shared\intrato e.g.0.2or0.4` to see if the change in convergence can be accounted for by slightly adjusting the levels of regularisation.

Once I have the time, I'll also check again with newer PyTorch versions to see if I can replicate the issue! The plot I published was done from a repo that has been adjusted, so I'll also check it with a version most closest to this specific one :).

Dyfine commented 3 years ago

Thanks for your reply and suggestions:) I'll keep trying and could you share with me the specific environment settings you use (e.g. the version of pytorch, faiss, cuda). I may try it if available.

XinyiXuXD commented 3 years ago

Hi, may I ask that whether you reproduce the results of Cars196? I tried but only get about 84 for the recall@1 metric. BTW, I have set [diva_rho_decorrelation, alpha] = [100, 0.1] as the paper said. @Dyfine

XinyiXuXD commented 3 years ago

Hi, thanks for your great job. DiVA is such interesting work! Could you please provide the information about hyperparameters on Cars196 and SOP datasets, as I came across some issues when trying to reproduce the results in your paper? @Confusezius

Confusezius commented 3 years ago

Hey there, so I was able to reproduce the results on two separate server instances, and the specific parameters used where

rho_decorrelation = 100, alpha = 0.15

for Cars196 (slight difference to the results reported in the paper due to some small pipeline changes, also make sure you check the results which reweigh the non-discriminative branches with 1.5 and the discriminative one with 0.5 which offers the best regularization) and

rho_recorrelation = 150, alpha = 0.2

Let me know if that helps! Since there are quite a lot of moving parts, it can be a bit fickle and setup dependent.

XinyiXuXD commented 3 years ago

Hi, thanks for your reply! I will try harder on the Cars196 based on the details you supplied. I'll back here if I get some new results. Is the setting that rho_recorrelation = 150, alpha = 0.2 for SOP dataset? @Confusezius

Confusezius commented 3 years ago

Yes it is :)

Dyfine commented 3 years ago

Hi, may I ask that whether you reproduce the results of Cars196? I tried but only get about 84 for the recall@1 metric. BTW, I have set [diva_rho_decorrelation, alpha] = [100, 0.1] as the paper said. @Dyfine

Hi @XinyiXuXD , sorry that I haven't conduct experiments on Cars and SOP datasets. May I ask that have you reproduced the results on CUB? My experiments on CUB get a best R@1 68.84 which is a little worse than the reported result.

XinyiXuXD commented 3 years ago

Hi @Dyfine, I didn't get the performance reported by the paper neither.

Confusezius commented 3 years ago

Hey so R@1 of 68.84 is reasonably close on CUB, however 84 is way to low on cars and shouldn't be happening. Are you using the Inceptionnet backbone? Because in that case, 84 should be the score range in which you land.

Confusezius commented 3 years ago

@XinyiXuXD Just to make sure, could you list the parameter setting with which you are running on CARS196 (and the other datasets)?

XinyiXuXD commented 3 years ago

Hi @Confusezius, I set the weight for each branch to 1 at first, and get around 84 for Cars196. According to your reply, I reweight the discriminate branch with 0.5 and the non-discriminate ones with 1.5 and get around 86 for Cars196. The weights for the branches have a big effect based on my experimental results.

BTW, how did you get the weights for the branches?

Confusezius commented 3 years ago

The branch weights you can get from simple (cross-)validation, they transfer very well to the test case :). Indeed, the default setting should cover a range of default branch weightings that may help. It's really hard without validation experiments to determine what good branch weights are, it depends how well these auxiliary features can be estimates on these dataset. For example, for Cars196, all auxiliary feature types are well defined and can be estimated pretty well, which is why a higher weighting is beneficial.

Either way, treat the weighting as a simply hyperparameter determined via validation experiments :).

Confusezius commented 3 years ago

Let me know if there are other issues either by reopening or opening another issue :).