Closed youcaiSUN closed 2 years ago
What specific pre-trained model are you looking for? Is it already fine-tuned on the Emotion Classification objective or the one with only pre-training objective? (Say, running the further pre-training on IEMOCAP) And currently I have models trained on specific splits, for instance, training on Session 1~4 and test on Session 5, but I don't have a model that trained on the whole IEMOCAP. What is your usage there?
Sorry for the misunderstanding, just the one with only pre-training objective. I pre-trained one model using TPAT, however, the average uar is just~71 (~2% gap). So I want to use your pre-trained models in my downstream tasks. Besides, do I need to pre-train the model for each session and then fine-tune it on that session or just pre-train only once and finetune it on all sessions (since the pre-training is time-consuming)?
For the purpose of the research experiment, I have to run it for each session respectively to make sure I don't have access to the test set even during the pretraining phase. So the numbers you saw is by running TAPT (for each split) respectively instead of using a single model.
If you have your own dataset, it will be better to just run the continue-pretraining procedure (TAPT) on your own dataset instead of IEMOCAP. Or simply run TPAT with all the data including IEMOCAP...
The checkpoints I currently have in hand is the pretrained model of each session after TAPT. Are you requesting all of them (5 checkpoints)?
Great, I see!
Yeah, I request all of them. And could you please upload all your pre-trained models (using TAPT and P-TAPT) to online drive (may be Google, or Baidu) so that the community can use them in their own research.
Thanks very much!
Here you go. https://cmu.box.com/s/yitaa74udy23vqubm9ox2pyf2kb3sg5n
But I didn't double check checkpoints for you. (The experiments have been done for a while)
OK! I'll try it.
thanks!!!
Here you go. https://cmu.box.com/s/yitaa74udy23vqubm9ox2pyf2kb3sg5n
But I didn't double check checkpoints for you. (The experiments have been done for a while)
Dear Liwei @b04901014,
I finetuned your pre-trained models using TAPT and PTAPT on IEMOCAP. Since my GPU has only 32GB memory which could not afford a batch size of 64, I slightly modified the trainer setting (i.e., I accumulate gradients every 2 batches with a batch size of 32) in _run_downstream_custom_multiple_fold.py_, which is shown as below:
trainer = Trainer(
precision=args.precision,
amp_backend='native',
callbacks=[checkpoint_callback] if hasattr(model, 'valid_met') else None,
checkpoint_callback=hasattr(model, 'valid_met'),
resume_from_checkpoint=None,
check_val_every_n_epoch=1,
accumulate_grad_batches=64//args.batch_size,
max_epochs=hparams.max_epochs,
num_sanity_val_steps=2 if hasattr(model, 'valid_met') else 0,
gpus=1,
logger=False
)
Besides, I modified the checkpoint loading part in downstream/Custom/trainer.py to make sure that every fold the model initializes from corresponding checkpoint.
if self.hp.pretrained_path is not None:
import glob
ckpt_files = sorted(glob.glob(os.path.join(self.hp.pretrained_path, '*.ckpt')))
if len(ckpt_files) == self.hp.nfolds:
pretrained_path = ckpt_files[self.hp.ifold]
else:
pretrained_path = self.hp.pretrained_path
assert pretrained_path.endswith('.ckpt')
print(f"==> Fold {self.hp.ifold +1 } use pre-trained model '{pretrained_path}' ")
self.model = PretrainedRNNHead.load_from_checkpoint(pretrained_path, strict=False,
n_classes=self.dataset.nemos,
backend=self.hp.model_type)
else:
self.model = PretrainedRNNHead(n_classes=self.dataset.nemos,
backend=self.hp.model_type)
The average results (five experiments for each session) are as follows:
Method | Stat Func | UAR | WAR | macroF1 | microF1 |
---|---|---|---|---|---|
PTAPT | mean | 71.6207 | 70.5469 | 70.9655 | 72.0825 |
PTAPT | std | 0.8163 | 0.9130 | 0.9821 | 0.6793 |
TAPT | mean | 71.4809 | 70.0057 | 70.3757 | 71.8254 |
TAPT | std | 0.5025 | 0.4757 | 0.3672 | 0.1165 |
TAPT* | mean | 71.4191 | 70.1220 | 70.4931 | 71.7437 |
TAPT* | std | 1.1267 | 1.1062 | 1.2942 | 0.8730 |
*: using my own pre-trained model under the default TAPT setting in README.md except that training_step=16000 with a batch size of 32 (about 1000 epochs on IEMOCAP Session 1 train set), and in the finetuning stage it is used for all sessions.
From the above table, the three kinds of models have similar performance. Since the results have a gap with those reported in the paper, could you help me to figure out what's wrong with my finetuning process?
Best, Licai
Hmm. I'm a little bit confused. Are you running n-folds with different session as testing set? Or are the performance you reporting on a specific split? (Say all are test on session 1)
I mean, if you are reporting only the session 1 performance (run session 1 five times), you should be only using the session 1 checkpoint? If you only run 1 run with the 5 splits (test on session 1 ~ 5), the std should be much larger, since the performance on different splits differs very much.
Sorry, the results I reported are the mean/std of five sessions (and each session I ran five times). I use the following command (using TAPT as an example) for finetuning:
CUDA_VISIBLE_DEVICES=0 python run_downstream_custom_multiple_fold.py --precision 16 \
--batch_size 32 \
--num_exps 5 \
--saving_path ./saved/downstream/TAPT/author \
--pretrained_path ./saved/pretrain/TAPT/author
Besides, the std in the above table as calculated as "np.std(np.mean(metrics, 2), 1)" instead of the default fold std "np.mean(np.std(metric, 1)". The results of fold std are as below (much larger than std as you said): Method | Stat Func | UAR | WAR | macroF1 | microF1 |
---|---|---|---|---|---|
PTAPT | mean | 71.6207 | 70.5469 | 70.9655 | 72.0825 |
PTAPT | std | 0.8163 | 0.9130 | 0.9821 | 0.6793 |
PTAPT | fold std | 3.0007 | 2.2742 | 2.1910 | 2.2853 |
TAPT | mean | 71.4809 | 70.0057 | 70.3757 | 71.8254 |
TAPT | std | 0.5025 | 0.4757 | 0.3672 | 0.1165 |
TAPT | fold std | 3.9209 | 3.7091 | 3.7024 | 3.2703 |
TAPT* | mean | 71.4191 | 70.1220 | 70.4931 | 71.7437 |
TAPT* | std | 1.1267 | 1.1062 | 1.2942 | 0.8730 |
TAPT* | fold std | 3.6798 | 3.2672 | 3.2352 | 2.9478 |
Could you check if the five label file in your labeldir
correspond to the 5 splits? It looks like you modified the code to hardcode the labeldir
, originally it is passed as a command line argument.
Also, if you use accumulate_grad_batches
to accumulate gradients every two step, then you should double the max_epoch
, otherwise you are using only half the training steps.
BTW, you can have more memory if you set https://github.com/b04901014/FT-w2v2-ser/blob/main/modules/FeatureFuser.py#L103 to True
.
I don't think the performance should differ this much though.
The std is the std (mean(across five sessions) across five experiments).
The five label files in labeldir
are generated using your preprocessing code, which can be downloaded from labels_sess.zip.
I checked the label files. The test split for each fold only include the samples in the correponding session.
The oringinal argument for labeldir
in downstream finetuning is the path to a specific checkpoint file. As I said above, to make sure that each session uses the correponding pre-trained checkpoint which do not 'see' the test set of that session, I modified the checkpoint loading part .
Hmm. Then could you set the gradient_checkpointing
in https://github.com/b04901014/FT-w2v2-ser/blob/main/modules/FeatureFuser.py#L103
to True
then re-run with batch_size = 64
without accumulating gradients ?
With accumulate gradients every 2 steps, you are essentially running only 7.5 epochs instead of 15.
Otherwise I could not find anything weird about your experiment currently.
Could you check if the five label file in your
labeldir
correspond to the 5 splits? It looks like you modified the code to hardcode thelabeldir
, originally it is passed as a command line argument.Also, if you use
accumulate_grad_batches
to accumulate gradients every two step, then you should double themax_epoch
, otherwise you are using only half the training steps. BTW, you can have more memory if you set https://github.com/b04901014/FT-w2v2-ser/blob/main/modules/FeatureFuser.py#L103 toTrue
.I don't think the performance should differ this much though.
Thanks very much!
I believe that the training steps using a batch size of 32 and accumulate_grad_batches=2
is equal to that using a batch size of 64. Nevertheless, I will rerun the code with a batch size of 64 using distributed parallel training to examine the influence of differerent batch sizes.
NVM, I think you are right. They should be equivalent. Then I actually don't know where should be the issue then. Do you mind just uploading your whole code as a zip file without the checkpoints? Maybe I could look into it.
My pleasure! The code is here FT-w2v2-ser.zip.
I use the shell in scripts
folder to pretrain and finetune models on IEMOCAP.
The results are saved in saved
folder (detailed results for finetuning are also included in the subfolders).
Any questions is welcome.
I have seen several numbers below 76% for the session 2 (Fold 2) for individual runs on TAPT and PTAPT. Which I almost haven't seen one throughout my experiments. There is definitely something wrong, will inspect in it.
Thanks very much! I'm looking forward to your message.
I just ran your code on session 2 only (one fold), here is what I got:
There is a bad 76 run there, but the most of them are around 78~81%. Compared to what you got in your run:
I think the difference is too huge that they are unlikely to be drawn from the same distribution. The only code I changed is the path in the script into:
CUDA_VISIBLE_DEVICES=3 python run_downstream_custom_multiple_fold.py --precision 16 \
--batch_size 32 \
--num_exps 5 \
--saving_path /tmp \
--pretrained_path /usr1/liweiche/exps_new/IEMOCAP/checkpoints/session2.ckpt \
--datadir ../FT-w2v2-ser/Dataset/IEMOCAP/Audio_16k/ \
--labeldir /usr1/liweiche/exps_new/IEMOCAP/session2/labels/
And on the "/share/home/sunlicai/transformers/wav2vec2-base"
to "facebook/wav2vec2-base"
. I'm not sure how you get your local pretrained model there. But unless the config.json in it is the same as https://huggingface.co/facebook/wav2vec2-base/blob/main/config.json . (The dropout ratio, and some other hyper-parameters depends on that) It should be fine.
I did not change anything else including the gradient accumulation and seed. If the pretrained model config is not a problem, then the only thing I could think of is the environment (are you using the same version of packages as in the Dockerfile? especially the huggingface version), or there may be bugs during the transition of folds...
Much thanks to your helpful debugging.
I re-ran the code with batch_size=64
by following your suggestion to set gradient_checkpointing to True
.
The results of PTAPT are as follows: Exp | Fold | UAR | WAR | macroF1 | microF1 |
---|---|---|---|---|---|
1 | 1 | 69.8167 | 64.3318 | 64.7215 | 69.2565 |
1 | 2 | 77.6739 | 72.9228 | 74.4119 | 76.8706 |
1 | 3 | 69.3841 | 68.5491 | 69.1264 | 69.2956 |
1 | 4 | 73.4489 | 73.9088 | 73.5648 | 73.8406 |
1 | 5 | 75.2707 | 73.1668 | 73.5307 | 73.9309 |
2 | 1 | 73.7616 | 70.4147 | 71.3115 | 72.6207 |
2 | 2 | 80.7606 | 78.0059 | 78.9644 | 79.5461 |
2 | 3 | 69.4304 | 69.9392 | 70.2692 | 71.0953 |
2 | 4 | 69.5929 | 69.7381 | 69.5926 | 71.0474 |
2 | 5 | 73.5148 | 71.8775 | 72.7360 | 73.4933 |
3 | 1 | 72.2615 | 69.3088 | 70.2676 | 71.2854 |
3 | 2 | 74.9120 | 73.0205 | 74.0394 | 75.0084 |
3 | 3 | 70.9914 | 70.4605 | 70.7650 | 70.9058 |
3 | 4 | 73.8345 | 75.3637 | 74.4753 | 74.9603 |
3 | 5 | 72.3662 | 71.9581 | 72.1026 | 72.1454 |
4 | 1 | 70.7526 | 66.4516 | 67.6090 | 70.4535 |
4 | 2 | 75.6213 | 72.8250 | 74.1808 | 76.2389 |
4 | 3 | 67.9802 | 67.8540 | 68.6463 | 69.5036 |
4 | 4 | 72.2317 | 71.2900 | 71.3094 | 72.1324 |
4 | 5 | 72.0067 | 70.5077 | 70.3071 | 71.1857 |
5 | 1 | 71.8092 | 70.2304 | 71.0657 | 71.4374 |
5 | 2 | 79.5007 | 76.8328 | 77.2310 | 77.9573 |
5 | 3 | 71.6541 | 71.5899 | 71.9562 | 72.1773 |
5 | 4 | 74.7210 | 75.1697 | 75.1347 | 75.2602 |
5 | 5 | 73.9711 | 73.3280 | 74.1058 | 74.2059 |
Stat Func | mean | 73.0907 | 71.5618 | 72.0570 | 73.0342 |
Stat Func | std | 0.8454 | 1.2663 | 1.2180 | 0.7900 |
Stat Func | fold_std | 2.7925 | 2.7066 | 2.6447 | 2.5047 |
The results of TAPT are as follows: Exp | Fold | UAR | WAR | macroF1 | microF1 |
---|---|---|---|---|---|
1 | 1 | 72.4140 | 69.2166 | 69.9198 | 70.8831 |
1 | 2 | 78.8307 | 77.1261 | 77.6667 | 78.5128 |
1 | 3 | 67.4721 | 67.0721 | 67.1881 | 67.2617 |
1 | 4 | 76.4685 | 76.3337 | 76.1041 | 76.2746 |
1 | 5 | 73.2559 | 71.3940 | 71.6633 | 72.2239 |
2 | 1 | 70.3762 | 64.3318 | 65.6699 | 70.2334 |
2 | 2 | 77.3431 | 75.8553 | 77.0228 | 77.0488 |
2 | 3 | 65.5209 | 64.5526 | 64.9570 | 65.2109 |
2 | 4 | 71.1131 | 70.9990 | 71.3480 | 72.9595 |
2 | 5 | 74.7123 | 72.2804 | 73.2582 | 74.5627 |
3 | 1 | 68.2955 | 62.0277 | 62.8583 | 68.3240 |
3 | 2 | 76.2815 | 74.6823 | 75.9064 | 76.6184 |
3 | 3 | 64.3476 | 64.3788 | 64.0973 | 65.9882 |
3 | 4 | 73.1064 | 71.8720 | 71.5928 | 72.7076 |
3 | 5 | 68.1130 | 69.1378 | 69.4952 | 70.0967 |
4 | 1 | 72.1599 | 68.6636 | 69.4446 | 70.8491 |
4 | 2 | 74.9441 | 73.4115 | 74.3779 | 76.2800 |
4 | 3 | 68.5871 | 68.2016 | 68.4900 | 68.6162 |
4 | 4 | 76.4899 | 76.6246 | 76.5327 | 76.5754 |
4 | 5 | 72.2048 | 69.7824 | 70.3533 | 71.4640 |
5 | 1 | 70.0527 | 64.8848 | 65.8348 | 69.6342 |
5 | 2 | 76.0400 | 71.0655 | 71.6846 | 74.8651 |
5 | 3 | 67.6256 | 67.6803 | 67.8072 | 67.9602 |
5 | 4 | 72.0541 | 73.9088 | 73.3376 | 74.2588 |
5 | 5 | 73.3085 | 72.2804 | 72.3810 | 72.6330 |
Stat Func | mean | 72.0447 | 70.3105 | 70.7597 | 72.0817 |
Stat Func | std | 1.2300 | 1.3362 | 1.3040 | 0.7987 |
Stat Func | fold_std | 3.5281 | 3.9150 | 3.8503 | 3.5022 |
The results of PTAPT and TAPT are better than previous ones, especially PTAPT. It seems that using a smaller batches with gradient accmulation here is not fully equivalent to using larger batches without it. However, these is still ~1.5% performance gap.
I manually installed most packages (except fairseq, which is not used currently) with the versions specified in your dockerfile. You can check the env.txt.
My local pre-trained wav2vec2.0 base model is downloaded from official website about seven months ago. After I checked my local config.json, I found that it was not same to the lastest one (updated six months ago). One noticable difference is the architectures
(old 'Wav2Vec2Model' vs. new 'Wav2Vec2ForPreTraining'). I also notice that warning (' Found keys that are not in the model state dict but in the checkpoint') occurred when loads pretrained checkpoint.
==> Fold 1 use pre-trained model './saved/pretrain/TAPT/author/session1.ckpt'
Some weights of Wav2Vec2ForPreTraining were not initialized from the model checkpoint at /share/home/sunlicai/transformers/wav2vec2-base and are newly initialized: ['project_hid.bias', 'project_q.weight', 'quantizer.codevectors', 'quantizer.weight_proj.weight', 'quantizer.weight_proj.bias', 'project_hid.weight', 'project_q.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/share/home/sunlicai/anaconda3/envs/pytorch1.9/lib/python3.8/site-packages/pytorch_lightning/core/saving.py:205: UserWarning: Found keys that are in the model state dict but not in the checkpoint: ['rnn_head.weight_ih_l0', 'rnn_head.weight_hh_l0', 'rnn_head.bias_ih_l0', 'rnn_head.bias_hh_l0', 'rnn_head.weight_ih_l0_reverse', 'rnn_head.weight_hh_l0_reverse', 'rnn_head.bias_ih_l0_reverse', 'rnn_head.bias_hh_l0_reverse', 'linear_head.1.weight', 'linear_head.1.bias']
rank_zero_warn(
/share/home/sunlicai/anaconda3/envs/pytorch1.9/lib/python3.8/site-packages/pytorch_lightning/core/saving.py:209: UserWarning: Found keys that are not in the model state dict but in the checkpoint: ['wav2vec2.wav2vec2PT.wav2vec2.masked_spec_embed', 'wav2vec2.wav2vec2PT.wav2vec2.feature_extractor.conv_layers.0.conv.weight', 'wav2vec2.wav2vec2PT.wav2vec2.feature_extractor.conv_layers.0.layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.feature_extractor.conv_layers.0.layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.feature_extractor.conv_layers.1.conv.weight', 'wav2vec2.wav2vec2PT.wav2vec2.feature_extractor.conv_layers.2.conv.weight', 'wav2vec2.wav2vec2PT.wav2vec2.feature_extractor.conv_layers.3.conv.weight', 'wav2vec2.wav2vec2PT.wav2vec2.feature_extractor.conv_layers.4.conv.weight', 'wav2vec2.wav2vec2PT.wav2vec2.feature_extractor.conv_layers.5.conv.weight', 'wav2vec2.wav2vec2PT.wav2vec2.feature_extractor.conv_layers.6.conv.weight', 'wav2vec2.wav2vec2PT.wav2vec2.feature_projection.layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.feature_projection.layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.feature_projection.projection.weight', 'wav2vec2.wav2vec2PT.wav2vec2.feature_projection.projection.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.pos_conv_embed.conv.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.pos_conv_embed.conv.weight_v', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.0.attention.k_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.0.attention.k_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.0.attention.v_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.0.attention.v_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.0.attention.q_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.0.attention.q_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.0.attention.out_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.0.attention.out_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.0.layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.0.layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.0.feed_forward.intermediate_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.0.feed_forward.intermediate_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.0.feed_forward.output_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.0.feed_forward.output_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.0.final_layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.0.final_layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.1.attention.k_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.1.attention.k_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.1.attention.v_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.1.attention.v_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.1.attention.q_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.1.attention.q_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.1.attention.out_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.1.attention.out_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.1.layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.1.layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.1.feed_forward.intermediate_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.1.feed_forward.intermediate_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.1.feed_forward.output_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.1.feed_forward.output_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.1.final_layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.1.final_layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.2.attention.k_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.2.attention.k_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.2.attention.v_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.2.attention.v_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.2.attention.q_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.2.attention.q_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.2.attention.out_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.2.attention.out_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.2.layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.2.layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.2.feed_forward.intermediate_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.2.feed_forward.intermediate_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.2.feed_forward.output_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.2.feed_forward.output_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.2.final_layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.2.final_layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.3.attention.k_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.3.attention.k_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.3.attention.v_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.3.attention.v_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.3.attention.q_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.3.attention.q_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.3.attention.out_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.3.attention.out_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.3.layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.3.layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.3.feed_forward.intermediate_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.3.feed_forward.intermediate_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.3.feed_forward.output_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.3.feed_forward.output_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.3.final_layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.3.final_layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.4.attention.k_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.4.attention.k_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.4.attention.v_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.4.attention.v_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.4.attention.q_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.4.attention.q_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.4.attention.out_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.4.attention.out_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.4.layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.4.layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.4.feed_forward.intermediate_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.4.feed_forward.intermediate_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.4.feed_forward.output_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.4.feed_forward.output_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.4.final_layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.4.final_layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.5.attention.k_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.5.attention.k_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.5.attention.v_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.5.attention.v_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.5.attention.q_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.5.attention.q_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.5.attention.out_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.5.attention.out_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.5.layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.5.layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.5.feed_forward.intermediate_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.5.feed_forward.intermediate_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.5.feed_forward.output_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.5.feed_forward.output_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.5.final_layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.5.final_layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.6.attention.k_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.6.attention.k_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.6.attention.v_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.6.attention.v_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.6.attention.q_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.6.attention.q_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.6.attention.out_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.6.attention.out_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.6.layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.6.layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.6.feed_forward.intermediate_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.6.feed_forward.intermediate_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.6.feed_forward.output_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.6.feed_forward.output_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.6.final_layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.6.final_layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.7.attention.k_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.7.attention.k_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.7.attention.v_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.7.attention.v_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.7.attention.q_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.7.attention.q_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.7.attention.out_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.7.attention.out_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.7.layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.7.layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.7.feed_forward.intermediate_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.7.feed_forward.intermediate_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.7.feed_forward.output_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.7.feed_forward.output_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.7.final_layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.7.final_layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.8.attention.k_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.8.attention.k_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.8.attention.v_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.8.attention.v_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.8.attention.q_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.8.attention.q_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.8.attention.out_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.8.attention.out_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.8.layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.8.layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.8.feed_forward.intermediate_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.8.feed_forward.intermediate_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.8.feed_forward.output_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.8.feed_forward.output_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.8.final_layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.8.final_layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.9.attention.k_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.9.attention.k_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.9.attention.v_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.9.attention.v_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.9.attention.q_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.9.attention.q_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.9.attention.out_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.9.attention.out_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.9.layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.9.layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.9.feed_forward.intermediate_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.9.feed_forward.intermediate_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.9.feed_forward.output_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.9.feed_forward.output_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.9.final_layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.9.final_layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.10.attention.k_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.10.attention.k_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.10.attention.v_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.10.attention.v_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.10.attention.q_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.10.attention.q_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.10.attention.out_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.10.attention.out_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.10.layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.10.layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.10.feed_forward.intermediate_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.10.feed_forward.intermediate_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.10.feed_forward.output_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.10.feed_forward.output_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.10.final_layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.10.final_layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.11.attention.k_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.11.attention.k_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.11.attention.v_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.11.attention.v_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.11.attention.q_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.11.attention.q_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.11.attention.out_proj.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.11.attention.out_proj.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.11.layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.11.layer_norm.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.11.feed_forward.intermediate_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.11.feed_forward.intermediate_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.11.feed_forward.output_dense.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.11.feed_forward.output_dense.bias', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.11.final_layer_norm.weight', 'wav2vec2.wav2vec2PT.wav2vec2.encoder.layers.11.final_layer_norm.bias', 'wav2vec2.wav2vec2PT.quantizer.codevectors', 'wav2vec2.wav2vec2PT.quantizer.weight_proj.weight', 'wav2vec2.wav2vec2PT.quantizer.weight_proj.bias', 'wav2vec2.wav2vec2PT.project_q.weight', 'wav2vec2.wav2vec2PT.project_q.bias', 'wav2vec2.wav2vec2PT.project_hid.weight', 'wav2vec2.wav2vec2PT.project_hid.bias']
rank_zero_warn(
Weigh losses by prior distribution of each class: tensor([0.2906, 0.2043, 0.2069, 0.2983]).
Using native 16bit precision.
Maybe it's the devil. I'll fix the config issue by using the default facebook/wav2vec2-base
instead of my local one and re-run the code. Hope it helps!
Glad that you are getting better results on reproducing, while I'm not sure why there is a difference between batch_size=64
and the accumulate gradient stuff you did earlier... On my machine they ran about the same results.
Really wierd!
Glad that you are getting better results on reproducing, while I'm not sure why there is a difference between
batch_size=64
and the accumulate gradient stuff you did earlier... On my machine they ran about the same results.
I re-ran the code using facebook/wav2vec2-base
with batch_size=64
. However, the results are even worse than those when using my local config. I am totally lost!!!
Method | Stat Func | UAR | WAR | macroF1 | microF1 |
---|---|---|---|---|---|
PTAPT | mean | 72.2857 | 70.5583 | 71.0144 | 72.3176 |
PTAPT | std | 0.6220 | 0.4712 | 0.4265 | 0.4557 |
PTAPT | fold_std | 2.6346 | 2.2990 | 2.3078 | 2.1766 |
TAPT | mean | 71.6164 | 69.7006 | 70.0368 | 71.5589 |
TAPT | std | 0.8966 | 0.7048 | 0.7180 | 0.3754 |
TAPT | old_std | 4.2064 | 3.7524 | 3.9618 | 3.4980 |
Hmm. It's really hard for me to debug for this... I think you fixed your own seed and I don't think there is a difference between these three models. One can claim there's a variance merely across the average of 5 runs, but I don't think there's that much difference. And this could not explain why in every run there the steady gap between PTAPT and TAPT still exists, and I did not observe such a high variance in my local runs; all seem to indicate there do have some factor that changed across your three experiments. Additionally, I would say as I didn't fix the seed, having above 73% may be somewhat reasonable enough for PTAPT.
Thank you for bringing this up, but I'm afraid that I couldn't help more on this currently. One thing is that I don't really know what's happening there, another is that I'm occupied with another research project recently.
This issue is getting too lengthy. I'll close this issue. Feel free to open another one if you have questions.
Thanks very much!