ai-forever / ru-gpts

Russian GPT3 models.
Apache License 2.0
2.08k stars 442 forks source link

Problems with parameters for large and medium models #73

Closed denismashukov closed 3 years ago

denismashukov commented 3 years ago

Hi, please help me figure out the parameters for large and medium models 🤔

Thanks,

ubuntu@ip-172-31-25-179:~$ python3 ru-gpts/pretrain_gpt3.py \

--train-data-path "train.list" \ --test-data-path "valid.list" \ --max-files-per-process 100 \ --logging-dir="log" \ --save model \ --load-huggingface sberbank-ai/rugpt3large_based_on_gpt2 \ --save-interval 1000 \ --model-parallel-size 1 \ --num-layers 24 \ --hidden-size 1536 \ --num-attention-heads 16 \ --batch-size 1 \ --seq-length 2048 \ --max-position-embeddings 2048 \ --vocab-size 50257 \ --train-iters 200000 \ --resume-dataloader \ --distributed-backend nccl \ --lr 0.00015 \ --lr-decay-style cosine \ --weight-decay 1e-2 \ --warmup .01 \ --log-interval 100 \ --fp16 \ --checkpoint-activations \ --deepspeed-activation-checkpointing \ --deepspeed using world size: 1 and model-parallel size: 1 using dynamic loss scaling initializing model parallel with size 1 Pretrain GPT3 model arguments: attention_dropout ............ 0.1 num_attention_heads .......... 16 hidden_size .................. 1536 intermediate_size ............ None num_layers ................... 24 layernorm_epsilon ............ 1e-05 hidden_dropout ............... 0.1 max_position_embeddings ...... 2048 vocab_size ................... 50257 deep_init .................... False make_vocab_size_divisible_by . 8 cpu_optimizer ................ False cpu_torch_adam ............... False sparse_mode .................. all fp16 ......................... True fp32_embedding ............... False fp32_layernorm ............... False fp32_tokentypes .............. False fp32_allreduce ............... False hysteresis ................... 2 loss_scale ................... None loss_scale_window ............ 1000 min_scale .................... 1 batch_size ................... 1 weight_decay ................. 0.01 checkpoint_activations ....... True checkpoint_num_layers ........ 1 deepspeed_activation_checkpointing True clip_grad .................... 1.0 train_iters .................. 200000 log_interval ................. 100 logging_dir .................. log exit_interval ................ None seed ......................... 1234 reset_position_ids ........... False reset_attention_mask ......... False lr_decay_iters ............... None lr_decay_style ............... cosine lr ........................... 0.00015 min_lr ....................... 1e-06 warmup ....................... 0.01 save ......................... model save_interval ................ 1000 no_save_optim ................ False no_save_rng .................. False load ......................... None no_load_optim ................ False log_memory ................... False no_load_rng .................. False load_huggingface ............. sberbank-ai/rugpt3large_based_on_gpt2 export_huggingface ........... None huggingface_double_pos_embeddings False load_tag ..................... cacheprefix ................. finetune ..................... False resume_dataloader ............ True distributed_backend .......... nccl local_rank ................... None eval_batch_size .............. None eval_iters ................... 100 eval_interval ................ 1000 eval_seq_length .............. None eval_max_preds_per_seq ....... None overlapping_eval ............. 32 cloze_eval ................... False eval_hf ...................... False load_openai .................. False temperature .................. 1.0 top_p ........................ 0.0 top_k ........................ 0 out_seq_length ............... 256 tg_token_name ................ token.txt model_parallel_size .......... 1 shuffle ...................... False train_data ................... None use_npy_data_loader .......... False train_data_path .............. train.list val_data_path ................ test_data_path ............... valid.list input_data_sizes_file ........ sizes.txt delim ........................ , text_key ..................... sentence eval_text_key ................ None valid_data ................... None split ........................ 1000,1,1 test_data .................... None overwrite_cache .............. False lazy_loader .................. False loose_json ................... False presplit_sentences ........... False num_workers .................. 2 tokenizer_path ............... None cache_dir .................... None use_tfrecords ................ False seq_length ................... 2048 max_files_per_process ........ 100 max_preds_per_seq ............ None cuda ......................... True rank ......................... 0 world_size ................... 1 dynamic_loss_scale ........... True initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 Load tokenizer from sberbank-ai/rugpt3large_based_on_gpt2 Load RuGPT3 Dataset from train.list, 100 files per process R0/1: Loading dataset train.list R0/1: Check filelist train.list with root dir R0/1: Shard [0, 1] R0/1: Loaded 0/1 files R0/1: Loaded 9 examples, 18432 tokens Load RuGPT3 Dataset from valid.list, 100 files per process R0/1: Loading dataset valid.list R0/1: Check filelist valid.list with root dir R0/1: Shard [0, 1] 0%| | 0/1 [00:00<?, ?it/s]R0/1: Loaded 0/1 files 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 52.79it/s] R0/1: Loaded 103 examples, 210944 tokens padded vocab (size: 50257) with 7 dummy tokens (new size: 50264) end-of-document token: 0 building GPT3 model ... Load huggingface model from sberbank-ai/rugpt3large_based_on_gpt2 Traceback (most recent call last): File "ru-gpts/pretrain_gpt3.py", line 830, in main() File "ru-gpts/pretrain_gpt3.py", line 786, in main model, optimizer, lr_scheduler = setup_model_and_optimizer(args) File "ru-gpts/pretrain_gpt3.py", line 177, in setup_model_and_optimizer model = get_model(args) File "ru-gpts/pretrain_gpt3.py", line 78, in get_model model = load_huggingface_model(model, args.load_huggingface, args.huggingface_double_pos_embeddings) File "/home/ubuntu/ru-gpts/src/utils.py", line 485, in load_huggingface_model move_weights(model2fill, h_model, double_pos_embeddings) File "/home/ubuntu/ru-gpts/src/utils.py", line 465, in move_weights load_weights(transformer_model.wte, our.word_embeddings, dst2src) File "/home/ubuntu/ru-gpts/src/utils.py", line 432, in loadweights load.copy(data) RuntimeError: The size of tensor a (50264) must match the size of tensor b (50257) at non-singleton dimension 0

king-menin commented 3 years ago

set --make-vocab-size-divisible-by=1 i your bash script

denismashukov commented 3 years ago

Thanks