"AssertionError: lambda_coverage != 0.0 requires coverage attention" when coverage_attn is already set in the parameter

charlenellll commented 3 years ago

Hi, I am using OpenNMT-py==2.0.0rc1, and tried to use coverage mechanism while doing copy mechanism. I have tried to add coverage_attn: 'true' and lambda_coverage: 0.2 in my config yaml file, or add -coverage_attn -lambda_coverage 0.2 after the command when training, but either way leads to this error:

-- Tracebacks above this line can probably
                 be ignored --

Traceback (most recent call last):
  File "/home/admin/.local/lib/python3.6/site-packages/onmt/utils/distributed.py", line 209, in consumer
    batch_queue=batch_queue, semaphore=semaphore)
  File "/home/admin/.local/lib/python3.6/site-packages/onmt/train_single.py", line 107, in main
    valid_steps=opt.valid_steps)
  File "/home/admin/.local/lib/python3.6/site-packages/onmt/trainer.py", line 244, in train
    report_stats)
  File "/home/admin/.local/lib/python3.6/site-packages/onmt/trainer.py", line 379, in _gradient_accumulation
    trunc_size=trunc_size)
  File "/home/admin/.local/lib/python3.6/site-packages/onmt/utils/loss.py", line 160, in __call__
    shard_state = self._make_shard_state(batch, output, trunc_range, attns)
  File "/home/admin/.local/lib/python3.6/site-packages/onmt/modules/copy_generator.py", line 196, in _make_shard_state
    batch, output, range_, attns)
  File "/home/admin/.local/lib/python3.6/site-packages/onmt/utils/loss.py", line 246, in _make_shard_state
    assert coverage is not None, "lambda_coverage != 0.0 requires " \
AssertionError: lambda_coverage != 0.0 requires coverage attention

My full configs is as follow, I would be grateful if you can tell me where the problem is.

# Data config
data: 
    corpus:
        path_src: dataset/v5/train_feature.txt
        path_tgt: dataset/v5/train_label.txt
    valid:
        path_src: dataset/v5/valid_feature.txt
        path_tgt: dataset/v5/valid_label.txt

src_vocab: dataset/v5/extend_vocab2.txt
tgt_vocab: dataset/v5/extend_vocab2.txt

save_model: model/2-trm-copy-cov-lambda/model
tensorboard_log_dir: model/2-trm-copy-cov-lambda/

# General opts
save_checkpoint_steps: 1000
keep_checkpoint: 25
seed: 777
train_steps: 40000
valid_steps: 1000
warmup_steps: 8000
report_every: 100
early_stopping: 15
tensorboard: 'true'

# Optimization
accum_count: 2
optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0
param_init: 0.0
param_init_glorot: 'true'

# Batching
batch_size: 4096
batch_type: tokens
normalization: tokens
dropout: 0.1
label_smoothing: 0.1
max_generator_batches: 2

# Model
decoder_type: transformer
encoder_type: transformer
word_vec_size: 512
rnn_size: 512
layers: 6
transformer_ff: 2048
heads: 8
position_encoding: 'true'

# Copy mechanism
copy_attn: 'true'
global_attention: mlp
reuse_copy_attn: 'true'
bridge: 'true'
coverage_attn: 'true'
lambda_coverage: 0.2

# Train on multiple gpus
world_size: 2
gpu_ranks:
- 0
- 1

charlenellll commented 3 years ago

I think this may be related to the initialization of coverage attention. I also tried to add train_from: model/1-trm-copy-quickstart/model_step_25000.pt in the config file to train from existed checkpoint, and this problem still exists.

Or maybe there is something wrong with my configurations. In this case, I hope there could be some guide about how to set the model to use coverage mechanism correctly.

Please let me know if you have any idea about this, thanks!

francoishernandez commented 3 years ago

@pltrdy any idea on this one?

pltrdy commented 3 years ago

Actually the coverage mechanism isn't implemented for transformer decoders. Coverage comes from See 2018 which is based on RNNs instead (LSTM actually), therefore a single attention head.

It's not clear to me how such a coverage mechanism should behave in a multi-head situation such as transformer decoders (i.e. calculate coverage on all heads? a single head? not sure what to expect from that).

We should at least raise a clearer error in this case.

@charlenellll tell me if you have a precise idea on how the coverage should behave in this case.

charlenellll commented 3 years ago

@pltrdy I found a paper about using coverage on multi-head attention, in part 3.4: https://iopscience.iop.org/article/10.1088/1742-6596/1453/1/012004/pdf But I'm not sure about its effect. Looking forward to your opinion.

pltrdy commented 3 years ago

Well at least it gives some guidelines to implement coverage in the Transformer. Feel free to implement this paper and open a PR we would review it.

Results show some small improvements in terms of copy and overall quality. However, they compete over Gigaword against seq2seq and vanilla transformer which are not super strong baseline to be honest.

XiaoqingNLP commented 2 years ago

Actually the coverage mechanism isn't implemented for transformer decoders. Coverage comes from See 2018 which is based on RNNs instead (LSTM actually), therefore a single attention head.

It's not clear to me how such a coverage mechanism should behave in a multi-head situation such as transformer decoders (i.e. calculate coverage on all heads? a single head? not sure what to expect from that).

We should at least raise a clearer error in this case.

@charlenellll tell me if you have a precise idea on how the coverage should behave in this case.

Hi, Your consideration is the same as I was thinking, can the replication results on Transformer achieve positive gains without a coverage mechanism ?

pltrdy commented 2 years ago

@Qnlp Absolutely. And better results as well. Transformers has many heads, has encoder, decoder AND cross-attention (instead of a single cross-attention layer in RNNs) so it may generalize the concept of coverage without trouble.

The fact that no explicit coverage mechanism has been developed for transformers is because they do not really need it.

OpenNMT / OpenNMT-py

"AssertionError: lambda_coverage != 0.0 requires coverage attention" when coverage_attn is already set in the parameter #1908