Optimization hyper-parameters used in the paper

da03 commented 4 years ago

Dear authors,

May I ask for the hyper-parameters used in your paper for IWSLT (the smaller model) and WMT (the full model)? Such as learning rate, warmup steps, batch size, max learning rate, etc.

Thanks in advance!

Best, Yuntian

Edward-Sun commented 4 years ago

Hi Yuntian,

For our own tensor2tensor implementation, we use the same config as the standard Transformer.

For this model, we find using the same hyperparameters as the standard Transformer just works well.

For the fairseq implementation, I guess you can find the guides here:

https://github.com/pytorch/fairseq/tree/master/examples/nonautoregressive_translation

da03 commented 4 years ago

Thank you for your reply! My questions are more on the optimization side such as batch size/learning schedule (the fairseq instructions are only on WMT with 4 GPUs, not on IWSLT which I assume only uses a single GPU). Plus, unfortunately the fairseq instructions cannot reproduce the WMT results reported in the paper (I need to double-check if they implemented CRF or DCRF, but it seems to be a few BLEU points off even assuming that's CRF).

Edward-Sun commented 4 years ago

Hi Yuntian,

The original source code of structured-nat can be found at this link . But it was written years ago in Tensorflow and is not cleaned. So currently I recommend the use of fairseq.

As you can see from the transformer_nat_crf.py file in this codebase, we use the following configs for NAT-CRF:

hparams.add_hparam("transition_factor_size", 32) hparams.add_hparam("crf_beam_size", 64) hparams.add_hparam("test_crf_beam_size", 32) hparams.add_hparam("loss_multiplier", 0.5)

The other hyper-params are the same as the original Transformer, so they should be:

hparams.norm_type = "layer" hparams.hidden_size = 512 hparams.batch_size = 4096 hparams.max_length = 256 hparams.clip_grad_norm = 0. # i.e. no gradient clipping hparams.optimizer_adam_epsilon = 1e-9 hparams.learning_rate_decay_scheme = "noam" hparams.learning_rate = 0.1 hparams.learning_rate_warmup_steps = 4000 hparams.initializer_gain = 1.0 hparams.num_hidden_layers = 6 hparams.initializer = "uniform_unit_scaling" hparams.weight_decay = 0.0 hparams.optimizer_adam_beta1 = 0.9 hparams.optimizer_adam_beta2 = 0.98 hparams.num_sampled_classes = 0 hparams.label_smoothing = 0.1 hparams.shared_embedding_and_softmax_weights = True # Add new ones like this. hparams.add_hparam("filter_size", 2048) # Layer-related flags. If zero, these fall back on hparams.num_hidden_layers. hparams.add_hparam("num_encoder_layers", 0) hparams.add_hparam("num_decoder_layers", 0) # Attention-related flags. hparams.add_hparam("num_heads", 8) hparams.add_hparam("attention_key_channels", 0) hparams.add_hparam("attention_value_channels", 0) hparams.add_hparam("ffn_layer", "conv_hidden_relu") hparams.add_hparam("parameter_attention_key_channels", 0) hparams.add_hparam("parameter_attention_value_channels", 0) # All hyperparameters ending in "dropout" are automatically set to 0.0 # when not in training mode. hparams.add_hparam("attention_dropout", 0.0) hparams.add_hparam("relu_dropout", 0.0) hparams.add_hparam("pos", "timing") # timing, none hparams.add_hparam("nbr_decoder_problems", 1) hparams.add_hparam("proximity_bias", False) hparams.add_hparam("use_pad_remover", True) hparams.add_hparam("self_attention_type", "dot_product") hparams.add_hparam("max_relative_position", 0) hparams.layer_preprocess_sequence = "n" hparams.layer_postprocess_sequence = "da" hparams.layer_prepostprocess_dropout = 0.1 hparams.attention_dropout = 0.1 hparams.relu_dropout = 0.1 hparams.learning_rate_warmup_steps = 8000 hparams.learning_rate = 0.2

da03 commented 4 years ago

That's really helpful! I'll study the source code to make sure I get all hyper-parameters right. Thanks!

Edward-Sun / structured-nart

Optimization hyper-parameters used in the paper #3