Closed Lucky-Light-Sun closed 4 months ago
Hi, After modifying the mean and std of data_preprocessor, max_epochs of train_cfg and weight_decay of optim_wrapper according to Issure: Can you share the training configs?, the best 46.0000 retrieval/R1 is achieved at epoch 8. But there is till a little lower than the paper mentioned.
So can you shared the new train config for the latest code?
Best wished!
model = dict(
type='CLIPSimilarity_split',
visual_encoder=dict(type='VITCLIPPretrained_STAN', pretrained_model=pretrained_model, clip_weight=clip_weight),
text_encoder=dict(type='CLIPTextPretrained', pretrained_model=pretrained_model, clip_weight=clip_weight),
to_float32=True,
frozen_layers=False,
data_preprocessor=dict(
type='MultiModalDataPreprocessor',
preprocessors=dict(
imgs=dict(
type='ActionDataPreprocessor',
# mean=[122.771, 116.746, 104.093],
# std=[68.500, 66.632, 70.323],
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.3751],
format_shape='NCHW'),
text=dict(type='ActionDataPreprocessor', to_float32=False))),
tau = 0.01,
adapter=None)
train_cfg = dict(
type='EpochBasedTrainLoop', max_epochs=20, val_begin=1, val_interval=1)
optim_wrapper = dict(
type='AmpOptimWrapper',
optimizer=dict(
type='AdamW',
lr=2e-06,
betas=(0.9, 0.98),
eps=1e-08,
# weight_decay=0.05),
weight_decay=0.02),
paramwise_cfg=dict(
norm_decay_mult=0., bias_decay_mult=0.,
custom_keys={
'STAN': dict(lr_mult=10.),
}),
clip_grad=dict(max_norm=5, norm_type=2)
)
The fluctuation in retrieval results is relatively large, whereas fluctuations in report results within 1% are considered normal. For instance, in this code, our replication results for STAN are 0.5% lower than reported, while our replication for Mug-Stan is 1% higher than reported.
Hi, can you share the new training configs for MSRVTT? I ran the reimplemented code without modifying anything except for batch=32 for 4 machines(so total bz=128). And got the best 45.6000 retrieval/R1 is achieved at epoch 8 which is a little lower than the paper mentioned.
So I just wonder whether the best result mentioned in paper is still achieved by the new implemented code or some thing wrong in my MSRVTT config. Here is the config in my local project. Looking forward to your reply.
Best wishes!