ahaliassos / raven

Official implementation of RAVEn (ICLR 2023) and BRAVEn (ICASSP 2024)
MIT License
55 stars 4 forks source link

The configuration file for the large model has incorrect parameters, which is causing the model to fail to load correctly. #5

Open jinchiniao opened 1 year ago

jinchiniao commented 1 year ago

Some parameters in the configuration file are inconsistent with the provided model parameters. For example, in conf/model/visual_backbone/resnet_transformer_large.yaml (audio backbone may also have a similar issue), there seems to be a mismatch in decoder setting.

idim: 512
adim: 1024
aheads: 16
eunits: 4096
elayers: 24
transformer_frontend: conv3d
transformer_input_layer: vanilla_linear
dropout_rate: 0.1
transformer_attn_dropout_rate: 0.1
transformer_encoder_attn_layer_type: rel_mha
macaron_style: False
use_cnn_module: False
cnn_module_kernel: 31
zero_triu: False
a_upsample_ratio: 1
relu_type: swish
ddim: 256 #1024 maybe?
dheads: 4
dunits: 2048 #4096 maybe?
dlayers: 6 #9 maybe?
lsm_weight: 0.1
transformer_length_normalized_loss: False
rel_pos_type: latest
layerscale: True
init_values: 0.1
ff_bn_pre: True
post_norm: False
gamma_zero: False
gamma_init: 0.1
mask_init_type:
ctc_type: warpctc
drop_path: 0.1
mtlalpha: 0.1

It will cause errors when running related scripts:

size mismatch for decoder.decoders.5.self_attn.linear_q.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for decoder.decoders.5.self_attn.linear_k.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([256, 256]).
        size mismatch for decoder.decoders.5.self_attn.linear_k.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for decoder.decoders.5.self_attn.linear_v.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([256, 256]).
        size mismatch for decoder.decoders.5.self_attn.linear_v.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for decoder.decoders.5.self_attn.linear_out.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([256, 256]).
        size mismatch for decoder.decoders.5.self_attn.linear_out.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([256]).
....
jinchiniao commented 1 year ago

It should also be possible to address this limitation by modifying the bash file as follows

 python raven/test.py \
    data.modality=video \
    data/dataset=lrs3 \
    experiment_name=vsr_prelrs3vox2_large_ftlrs3vox2_selftrain_lm_test \
    model/visual_backbone=resnet_transformer_large \
   #  model.visual_backbone.ddim=256 \
   #  model.visual_backbone.dheads=4 \
   #  model.visual_backbone.dunits=2048 \
   #  model.visual_backbone.dlayers=6 \
    model.visual_backbone.ddim=1024 \
    model.visual_backbone.dheads=4 \
    model.visual_backbone.dunits=4096 \
    model.visual_backbone.dlayers=9 \
    model.pretrained_model_path=ckpts/vsr_prelrs3vox2_large_ftlrs3vox2_selftrain.pth \
    decode.lm_weight=0.2 \
    model.pretrained_lm_path=ckpts/language_model/rnnlm.model.best
ahaliassos commented 1 year ago

Hi, thanks for spotting this!

You are right, the parameters were wrong. In general, the width of the decoder should match that of the encoder, except in the low-resource setting, where the decoder width is smaller to avoid overfitting. I have updated the model configurations as well as the scripts to reflect that.

Please let me know if it still doesn't work.

jinchiniao commented 1 year ago

In scripts/vsr/lrs3_trainval/large_lrs3vox2.sh The following row in the corresponding table contains minor naming errors in the uploaded weights. Large | LRS3+Vox2-en | 32.5 | Download | scripts/vsr/lrs3_trainval/large_lrs3vox2.sh

The uploaded weight name is vsr_lrs3vox2_large_lrs3trainval.pth it should be vsr_prelrs3vox2_large_ftlrs3trainval.pth

ahaliassos commented 1 year ago

Fixed, thank you.

jinchiniao commented 1 year ago

when I run scripts/testing/vsr/lrs3/base_lrs3vox2.sh, error occurs,

        size mismatch for decoder.decoders.5.norm1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for decoder.decoders.5.norm1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for decoder.decoders.5.norm2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for decoder.decoders.5.norm2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for decoder.decoders.5.norm3.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for decoder.decoders.5.norm3.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for decoder.after_norm.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for decoder.after_norm.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([256]).
        size mismatch for decoder.output_layer.weight: copying a param with shape torch.Size([1049, 512]) from checkpoint, the shape in current model is torch.Size([1049, 256]).

maybe you uploaded the wrong version of weights?

when I run scripts/testing/vsr/lrs3/base_lrs3.sh, error occurs,


Error executing job with overrides: ['data.modality=video', 'data/dataset=lrs3', 'experiment_name=vsr_prelrs3vox2_base_ftlrs3_test', 'model/visual_backbone=resnet_transformer_base', 'model.pretrained_model_path=ckpts/vsr_prelrs3vox2_base_ftlrs3.pth']
Traceback (most recent call last):
  File "test.py", line 35, in main
    learner = Learner(cfg)
  File "/home/luosongtao/code/raven/finetune_learner.py", line 25, in __init__
    self.model = self.load_model()
  File "/home/luosongtao/code/raven/finetune_learner.py", line 42, in load_model
    ckpt = torch.load(
  File "/home/luosongtao/miniconda3/envs/byolav/lib/python3.8/site-packages/torch/serialization.py", line 713, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/luosongtao/miniconda3/envs/byolav/lib/python3.8/site-packages/torch/serialization.py", line 905, in _legacy_load
    return legacy_load(f)
  File "/home/luosongtao/miniconda3/envs/byolav/lib/python3.8/site-packages/torch/serialization.py", line 802, in legacy_load
    tar.extract('storages', path=tmpdir)
  File "/home/luosongtao/miniconda3/envs/byolav/lib/python3.8/tarfile.py", line 2060, in extract
    tarinfo = self.getmember(member)
  File "/home/luosongtao/miniconda3/envs/byolav/lib/python3.8/tarfile.py", line 1782, in getmember
    raise KeyError("filename %r not found" % name)
KeyError: "filename 'storages' not found"

but I run scripts/testing/vsr/lrs3_trainval/base_lrs3.sh and scripts/testing/vsr/lrs3_trainval/base_lrs3vox2.sh successfully. So maybe something wrong with the weights you uploaded?

ahaliassos commented 1 year ago

The high-resource base LRS3+Vox2 weights file was corrupted, but I have now uploaded it again. I also fixed the config for the visual backbone (which was accidentally not pushed to the repo earlier).

Please let me know whether it works or not now :)

jinchiniao commented 1 year ago

The high-resource base LRS3+Vox2 weights file for ASR need authorization(scripts/asr/lrs3/base_lrs3vox2.sh). I think you may have set it up incorrectly.