Open szha opened 3 years ago
The ELECTRA-base 300k steps checkpoint can be found at https://szha-nlp.s3.amazonaws.com/output_electra_base/0300000.params. This should help reproduce the parameter loading issue in SQuAD fine-tuning.
@sxjscience I know that we hypothesized that the error in loading the pre-trained model in squad is due to parameter deduplication during saving, still it doesn't seem immediately obvious which parameter the missing encoder.all_encoder_layers.0.attn_qkv.weight
parameter should be sharing its weight with. I see the following three parameters that have exact substring match: discriminator.backbone_model.encoder.all_encoder_layers.0.attn_qkv.weight
, disc_backbone.encoder.all_encoder_layers.0.attn_qkv.weight
, and generator.backbone_model.encoder.all_encoder_layers.0.attn_qkv.weight
. Is it one of them?
Description
As part of #1413 I was running the ELECTRA-base model and found several issues along the way.
dataloader KeyError error message
SQuAD parameter loading error message
``` % python3 scripts/question_answering/run_squad.py \ --model_name google_electra_base \ --data_dir squad \ --backbone_path output/0300000.params \ --output_dir output_finetune \ --version 1.1 \ --do_eval \ --do_train \ --batch_size 32 \ --num_accumulated 1 \ --gpus 0 \ --epochs 2 \ --lr 3e-4 \ --layerwise_decay 0.8 \ --warmup_ratio 0.1 \ --max_saved_ckpt 6 \ --all_evaluate \ --wd 0 \ --max_seq_length 128 \ --max_grad_norm 0.1 \ All Logs will be saved to output_finetune/finetune_squad1.1.log 2021-01-26 16:14:25,942 - root - INFO - Namespace(adam_betas='(0.9, 0.999)', adam_epsilon=1e-06, all_evaluate=True, backbone_path='output/0300000.params', batch_size=32, classifier_dropout=0.1, comm_backend='device', data_dir='squad', do_eval=True, do_train=True, doc_stride=128, dtype='float32', end_top_n=5, epochs=2.0, eval_batch_size=16, eval_log_interval=10, gpus='0', layerwise_decay=0.8, log_interval=50, lr=0.0003, max_answer_length=30, max_grad_norm=0.1, max_query_length=64, max_saved_ckpt=6, max_seq_length=128, model_name='google_electra_base', n_best_size=20, num_accumulated=1, num_train_steps=None, optimizer='adamw', output_dir='output_finetune', overwrite_cache=False, param_checkpoint=None, pre_shuffle_seed=100, round_to=None, save_interval=None, seed=100, start_top_n=5, untunable_depth=-1, version='1.1', warmup_ratio=0.1, warmup_steps=None, wd=0.0) [16:14:26] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU Traceback (most recent call last): File "../question_answering/run_squad.py", line 1007, inTo Reproduce
Follow the steps in https://github.com/dmlc/gluon-nlp/blob/09f343564e4f735df52e212df87ca073a824e829/scripts/pretraining/README.md. See below for the exact commands I used.
Steps to reproduce
Environment
I ran both scripts on p4dn.24xlarge with an environment bootstrapped by this cloudformation template. Details on some important dependencies:
HOROVOD_WITH_MXNET=1 HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_GLOO=1 python3 -m pip install --no-cache-dir horovod
for Horovod 0.21.1.