Closed da03 closed 4 years ago
Hi Yuntian,
For our own tensor2tensor implementation, we use the same config as the standard Transformer.
For this model, we find using the same hyperparameters as the standard Transformer just works well.
For the fairseq implementation, I guess you can find the guides here:
https://github.com/pytorch/fairseq/tree/master/examples/nonautoregressive_translation
Thank you for your reply! My questions are more on the optimization side such as batch size/learning schedule (the fairseq instructions are only on WMT with 4 GPUs, not on IWSLT which I assume only uses a single GPU). Plus, unfortunately the fairseq instructions cannot reproduce the WMT results reported in the paper (I need to double-check if they implemented CRF or DCRF, but it seems to be a few BLEU points off even assuming that's CRF).
Hi Yuntian,
The original source code of structured-nat can be found at this link . But it was written years ago in Tensorflow and is not cleaned. So currently I recommend the use of fairseq.
As you can see from the transformer_nat_crf.py file in this codebase, we use the following configs for NAT-CRF:
hparams.add_hparam("transition_factor_size", 32)
hparams.add_hparam("crf_beam_size", 64)
hparams.add_hparam("test_crf_beam_size", 32)
hparams.add_hparam("loss_multiplier", 0.5)
The other hyper-params are the same as the original Transformer, so they should be:
hparams.norm_type = "layer"
hparams.hidden_size = 512
hparams.batch_size = 4096
hparams.max_length = 256
hparams.clip_grad_norm = 0. # i.e. no gradient clipping
hparams.optimizer_adam_epsilon = 1e-9
hparams.learning_rate_decay_scheme = "noam"
hparams.learning_rate = 0.1
hparams.learning_rate_warmup_steps = 4000
hparams.initializer_gain = 1.0
hparams.num_hidden_layers = 6
hparams.initializer = "uniform_unit_scaling"
hparams.weight_decay = 0.0
hparams.optimizer_adam_beta1 = 0.9
hparams.optimizer_adam_beta2 = 0.98
hparams.num_sampled_classes = 0
hparams.label_smoothing = 0.1
hparams.shared_embedding_and_softmax_weights = True
# Add new ones like this.
hparams.add_hparam("filter_size", 2048)
# Layer-related flags. If zero, these fall back on hparams.num_hidden_layers.
hparams.add_hparam("num_encoder_layers", 0)
hparams.add_hparam("num_decoder_layers", 0)
# Attention-related flags.
hparams.add_hparam("num_heads", 8)
hparams.add_hparam("attention_key_channels", 0)
hparams.add_hparam("attention_value_channels", 0)
hparams.add_hparam("ffn_layer", "conv_hidden_relu")
hparams.add_hparam("parameter_attention_key_channels", 0)
hparams.add_hparam("parameter_attention_value_channels", 0)
# All hyperparameters ending in "dropout" are automatically set to 0.0
# when not in training mode.
hparams.add_hparam("attention_dropout", 0.0)
hparams.add_hparam("relu_dropout", 0.0)
hparams.add_hparam("pos", "timing") # timing, none
hparams.add_hparam("nbr_decoder_problems", 1)
hparams.add_hparam("proximity_bias", False)
hparams.add_hparam("use_pad_remover", True)
hparams.add_hparam("self_attention_type", "dot_product")
hparams.add_hparam("max_relative_position", 0)
hparams.layer_preprocess_sequence = "n"
hparams.layer_postprocess_sequence = "da"
hparams.layer_prepostprocess_dropout = 0.1
hparams.attention_dropout = 0.1
hparams.relu_dropout = 0.1
hparams.learning_rate_warmup_steps = 8000
hparams.learning_rate = 0.2
That's really helpful! I'll study the source code to make sure I get all hyper-parameters right. Thanks!
Dear authors,
May I ask for the hyper-parameters used in your paper for IWSLT (the smaller model) and WMT (the full model)? Such as learning rate, warmup steps, batch size, max learning rate, etc.
Thanks in advance!
Best, Yuntian