Tencent / PatrickStar

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP and democratizes AI for everyone.
BSD 3-Clause "New" or "Revised" License
747 stars 57 forks source link

支持TencentPretrain #57

Open feifeibear opened 3 years ago

feifeibear commented 3 years ago

TencentPretrain是TEG数据安全中心的repo,我们可以利用它们的模型结构和数据 https://git.woa.com/TencentNLP/TencentPretrain/merge_requests/61 TencentPretrain还有一个野生开源项目 https://github.com/dbiir/UER-py

feifeibear commented 3 years ago

GeForce RTX 2060上用TencentPretrain的run_patrickstar.sh跑了500步,对比了一下log。 PatrickStar Worker is training ... | 100/ 500 steps| 6164.26 tokens/s| loss 7.15| acc: 0.045 | 200/ 500 steps| 6226.79 tokens/s| loss 6.30| acc: 0.060 | 300/ 500 steps| 6208.92 tokens/s| loss 6.17| acc: 0.077 | 400/ 500 steps| 6232.11 tokens/s| loss 5.97| acc: 0.097

PyTorch | 100/ 500 steps| 24822.88 tokens/s| loss 7.11| acc: 0.043 | 200/ 500 steps| 24331.83 tokens/s| loss 6.25| acc: 0.063 | 300/ 500 steps| 24246.47 tokens/s| loss 6.10| acc: 0.080 | 400/ 500 steps| 24210.41 tokens/s| loss 5.92| acc: 0.094 | 500/ 500 steps| 23966.24 tokens/s| loss 5.87| acc: 0.105

感觉accuracy很相似,速度差点,不过可能是模型太小,这样派大星的overhead引起的。 派大星可以把batch增大到128,达到39072.26 tokens/s吞吐。

feifeibear commented 3 years ago

当前develop的 CPU embedding实现有问题,在TPT上,出现不收敛现象 "use_cpu_embedding": True 正确收敛,False和结果如下 | 100/ 500 steps| 65949.17 tokens/s| loss 7.15| acc: 0.053 | 200/ 500 steps| 67712.70 tokens/s| loss 6.40| acc: 0.043 | 300/ 500 steps| 67740.57 tokens/s| loss 6.35| acc: 0.044 | 400/ 500 steps| 67014.54 tokens/s| loss 6.39| acc: 0.043 | 500/ 500 steps| 66395.44 tokens/s| loss 6.37| acc: 0.044 错误的收敛速度是正确的二倍,说明可能adam没有更新 c177176 "use_cpu_embedding": True 正确收敛,False和上面一样 | 100/ 500 steps| 31161.26 tokens/s| loss 7.13| acc: 0.056 | 200/ 500 steps| 31332.15 tokens/s| loss 6.02| acc: 0.103 | 300/ 500 steps| 31321.69 tokens/s| loss 5.60| acc: 0.140 | 400/ 500 steps| 31348.33 tokens/s| loss 5.35| acc: 0.161 | 500/ 500 steps| 31268.56 tokens/s| loss 5.17| acc: 0.174

Pytorch的收敛性 | 100/ 500 steps| 53254.33 tokens/s| loss 6.87| acc: 0.054 | 200/ 500 steps| 53606.05 tokens/s| loss 5.84| acc: 0.106 | 300/ 500 steps| 53455.99 tokens/s| loss 5.39| acc: 0.150 | 400/ 500 steps| 53409.81 tokens/s| loss 5.20| acc: 0.170 | 500/ 500 steps| 52638.91 tokens/s| loss 5.11| acc: 0.176

feifeibear commented 3 years ago

"tie_weights": true不支持 如果用use_cpu_embedding会报错 image 如果不用则存在一个参数被复用的情况,触发已知的异常 File "/home/jiaruifang/codes/HybridPS/patrickstar/core/hook.py", line 179, in pre_sub_module_backward_function assert param.ps_attr.bwd_cnt == 0, f"Backward Propagation updates the gradient of a parameter twice. This is not allowed when using chunk reusing."

feifeibear commented 3 years ago

一个蛋疼的问题,有人可能这样写代码,但是PatrickStar并无法区分weight tensor被两个param共享的情况。 https://git.woa.com/TencentNLP/TencentPretrain/blob/master/tencentpretrain/models/model.py#L21

针对tie weight,即第一层embedding weight和最后一层linear的weight共享参数,目前存在的问题:

  1. use_cpu_embedding和tie weight冲突,因为embedding weight在第一层被当成torch param在cpu上计算nn.Embedding,在最后一层却需要在gpu上计算,pre_forward_hook目前无法正确处理。
  2. PreprocessCtx构造模型的,chunk-tensor-index包含一个无用的tensor(来自共享后应该删除的tensor)。
  3. use_cpu_embedding=False时,收敛性不正确。我不确定现在共享参数的反向传播是否实现正确了。 badcase复现 https://git.woa.com/jiaruifang/TencentPretrain/merge_requests/1
zhuzilin commented 3 years ago

环境

1xV100

运行指令

python preprocess.py --corpus_path corpora/book_review.txt --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 --target lm

python -m torch.distributed.launch --nproc_per_node=1 pretrain.py \
                    --dataset_path dataset.pt --vocab_path models/google_zh_vocab.txt \
                    --output_model_path models/output_model.bin \
                    --config_path models/gpt2/config_patrickstar_v2.json --learning_rate 1e-4 \
                    --world_size 1 --gpu_ranks 0 \
                    --embedding word_pos --remove_embedding_layernorm \
                    --encoder transformer --mask causal --layernorm_positioning pre \
                    --target lm \
                    --total_steps 500 --batch_size 64 \
                    --fp16 --report_steps 100 \
                    --use_patrickstar

配置

{
  "emb_size": 768,
  "feedforward_size": 3072,
  "hidden_size": 768,
  "hidden_act": "gelu_fast",
  "heads_num": 4,
  "layers_num": 3,
  "max_seq_length": 1024,
  "dropout": 0.1,
  "embedding": "word_pos",
  "remove_embedding_layernorm": true,
  "encoder": "transformer",
  "mask": "causal",
  "layernorm_positioning": "pre",
  "target": "lm"
}

运行结果: