XinhaoMei / DCASE2021_task6_v2

Code for CVSSP submission to DCASE 2021 Task 6
35 stars 5 forks source link

Difference with ACT #5

Closed Ynjxsjmh closed 2 years ago

Ynjxsjmh commented 2 years ago

Hi,

In my opinion, model structure in this repo is the same with your ACT model, they both use a full Transformer network based on an encoder-decoder architecture.

This paper further uses transfer learning and reinforcement learning to fine tune the parameters, but why the result in this paper is worse than the ACT?

image

image

Take following two models as an example:

Model BLEU1 BLEU2 BLEU3 BLEU4 ROUGEL METERO CIDEr SPICE SPIDEr
ACT_m_DeiT_AudioSet 0.653 0.495 0.363 0.259 0.471 0.222 0.663 0.163 0.413
B+PANNs+AC+RL 0.634 0.423 0.288 0.185 0.410 0.187 0.476 0.134 0.305
XinhaoMei commented 2 years ago

Hi, thanks for your interest. First, the DECASE model is based on a CNN-Transformer architecture, not a full transformer model like ACT. Second, they are evaluated on different datasets. ACT is evaluated on AudioCaps and DECASE model is evaluated on Clotho.

Ynjxsjmh commented 2 years ago

Thanks for your reply, I can well understand now. Some DECASE models are pretrained on AudioCaps and all are evaluated on Clotho.