Closed guijuzhejiang closed 2 years ago
@sserdoubleh 求大师指点迷津
训练脚本加上use_sharding="True",在train_args上配置sharding_degree为卡数即可
非常感谢,刚试了下,8个V100跑不起来
有同时开启 recompute 吗?
感谢您的指导,设置了use_sharding="True",use_recompute ="True"后,这次模型成功加载到显卡了,但马上报错了,貌似框架底层出了问题,请您在帮忙瞧瞧是什么问题
gen_comm_id_helper.cc:190] Server listening on: 127.0.0.1:52483 successful.
total params: 1401821184
Training is start.
/home/swordwu/miniconda3/envs/paddlenlp_py38/lib/python3.8/site-packages/paddle/fluid/rea der.py:136: DeprecationWarning: np.object
is a deprecated alias for the builtin object
. To silence this warning, use object
by itself. Doing this will not modify any behavi or and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/releas e/1.20.0-notes.html#deprecations
if arr.dtype == np.object:
Traceback (most recent call last):
File "./knover/scripts/train.py", line 307, in
File "./knover/scripts/train.py", line 307, in <module>
train(args)
File "./knover/scripts/train.py", line 97, in train
model = models.create_model(args, place)
File "/home/swordwu/Knover_20211227/knover/models/__init__.py", line 46, in create_mo del
return MODEL_REGISTRY[args.model](args, place)
File "/home/swordwu/Knover_20211227/knover/models/unified_transformer.py", line 101, in __init__
super(UnifiedTransformer, self).__init__(args, place)
File "/home/swordwu/Knover_20211227/knover/core/model.py", line 146, in __init__
self._build_programs()
File "/home/swordwu/Knover_20211227/knover/core/model.py", line 233, in _build_progra ms
scheduled_lr = self.optimize(metrics)
File "/home/swordwu/Knover_20211227/knover/core/model.py", line 467, in optimize
optimizer.minimize(metrics["loss"])
File "/home/swordwu/miniconda3/envs/paddlenlp_py38/lib/python3.8/site-packages/paddle /distributed/fleet/base/fleet_base.py", line 1500, in minimize
optimize_ops, params_grads = meta_optimizer.minimize(
File "/home/swordwu/miniconda3/envs/paddlenlp_py38/lib/python3.8/site-packages/paddle /distributed/fleet/meta_optimizers/meta_optimizer_base.py", line 94, in minimize
optimize_ops, params_grads = self.minimize_impl(
File "/home/swordwu/miniconda3/envs/paddlenlp_py38/lib/python3.8/site-packages/paddle /distributed/fleet/meta_optimizers/sharding_optimizer.py", line 511, in minimize_impl
optimize_ops, params_grads = self._inner_opt_minimize(
File "/home/swordwu/miniconda3/envs/paddlenlp_py38/lib/python3.8/site-packages/paddle /distributed/fleet/meta_optimizers/sharding_optimizer.py", line 252, in _inner_opt_minimi ze
optimize_ops, params_grads = self.inner_opt.minimize(
File "/home/swordwu/miniconda3/envs/paddlenlp_py38/lib/python3.8/site-packages/paddle /distributed/fleet/meta_optimizers/meta_optimizer_base.py", line 94, in minimize
optimize_ops, params_grads = self.minimize_impl(
File "/home/swordwu/miniconda3/envs/paddlenlp_py38/lib/python3.8/site-packages/paddle /distributed/fleet/meta_optimizers/amp_optimizer.py", line 116, in minimize_impl
self.wrapped_opt.minimize(loss, startup_program,
File "/home/swordwu/miniconda3/envs/paddlenlp_py38/lib/python3.8/site-packages/paddle /fluid/contrib/mixed_precision/decorator.py", line 456, in minimize
optimize_ops = self.apply_optimize(loss, startup_program,
File "/home/swordwu/miniconda3/envs/paddlenlp_py38/lib/python3.8/site-packages/paddle /fluid/contrib/mixed_precision/decorator.py", line 420, in apply_optimize
optimize_ops = self.apply_gradients(params_grads)
File "/home/swordwu/miniconda3/envs/paddlenlp_py38/lib/python3.8/site-packages/paddle /fluid/contrib/mixed_precision/decorator.py", line 414, in apply_gradients
optimize_ops = self._optimizer.apply_gradients(params_grads)
File "/home/swordwu/miniconda3/envs/paddlenlp_py38/lib/python3.8/site-packages/paddle /distributed/fleet/meta_optimizers/recompute_optimizer.py", line 83, in apply_gradients
return self.wrapped_opt.apply_gradients(params_grads=params_grads)
File "/home/swordwu/miniconda3/envs/paddlenlp_py38/lib/python3.8/site-packages/paddle /fluid/optimizer.py", line 6090, in apply_gradients
return self._optimizer.apply_gradients(params_grads=params_grads)
File "/home/swordwu/Knover_20211227/knover/optim/adamw.py", line 47, in apply_gradien ts
self._apply_weight_decay(params_grads)
File "/home/swordwu/Knover_20211227/knover/optim/adamw.py", line 40, in _apply_weight _decay
layers.assign(p * (1. - self.wd * self._learning_rate), p)
File "/home/swordwu/miniconda3/envs/paddlenlp_py38/lib/python3.8/site-packages/paddle /fluid/layers/math_op_patch.py", line 342, in __impl__
current_block(self).append_op(
File "/home/swordwu/miniconda3/envs/paddlenlp_py38/lib/python3.8/site-packages/paddle /fluid/framework.py", line 3178, in append_op
op = Operator(
File "/home/swordwu/miniconda3/envs/paddlenlp_py38/lib/python3.8/site-packages/paddle /fluid/framework.py", line 2224, in __init__
for frame in traceback.extract_stack():
NotFoundError: The variable X is not found when promote complex types.
[Hint: var should not be null.] (at /paddle/paddle/fluid/framework/operator.cc:1653 )
[operator < elementwise_mul > error]
INFO 2022-03-26 16:21:55,868 launch_utils.py:320] terminate process group gid:4045055 INFO 2022-03-26 16:21:55,868 launch_utils.py:320] terminate process group gid:4045065 INFO 2022-03-26 16:21:55,868 launch_utils.py:320] terminate process group gid:4045070 INFO 2022-03-26 16:21:55,868 launch_utils.py:320] terminate process group gid:4045075 INFO 2022-03-26 16:21:59,873 launch_utils.py:341] terminate all the procs ERROR 2022-03-26 16:21:59,873 launch_utils.py:602] ABORT!!! Out of all 8 trainers, the tr ainer process with rank=[0, 1, 3, 7] was aborted. Please check its log. INFO 2022-03-26 16:22:03,877 launch_utils.py:341] terminate all the procs INFO 2022-03-26 16:22:03,877 launch.py:311] Local processes completed.
之前训练,需要修改 paddle 的源代码,比较麻烦 现在比较方便的是把 weight_decay 设置成0,可以直接跑通 静态图的优化器部分代码也优化过,后面代码也会更新
果然weight_decay 设置成0跑通了,感谢感谢
非常感谢您开源的XL模型。我尝试用8个A100(每块40G显存)训练自己的XL模型,但因参数过大,显存还是不够。看plato-XL论文里面提到:Given the limited memory of each device, vanilla data parallelism cannot support the training of such a model with up to 11 billion parameters.As such, we adopt the sharded data parallelism (Rajbhandari et al., 2020) to eliminate memory redundancies, by partitioning the optimizer states, gradients and parameters across multiple devices. 请问论文里提到的这种模型参数跨多个显卡的训练方法要如何实现?