CoinCheung / gdGPT

Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.
Apache License 2.0
90 stars 8 forks source link

Multi-node model training #30

Open pipijiev12 opened 6 months ago

pipijiev12 commented 6 months ago

Is multi-machine training of large models suitable for multi-node large models? Secondly, can the large model be divided into blocks and allocated to each node for training? For example: Chatglm3 large model training requires four graphics cards with 48g of video memory on a single node to meet the demand. Can I use the multi-machine training method to divide the large model into two nodes with four graphics cards with 24g of video memory?

CoinCheung commented 6 months ago

Yes you can. You need to config the layout of the model, and then you need to write a hostfile to launch training. You need to configure your nodes so that they can connect each other via ssh.

CoinCheung commented 6 months ago

I remember I have write something about multi-node training in the read me. Please take a look.

pipijiev12 commented 6 months ago

baichuan2_13b.txt Based on the baichuan2-7b model file you provided, I used the baichuan2-13b model for simulation and rewritten it into a model layer. However, after the rewriting was completed, the following misalignment was reported when running: Traceback (most recent call last): File "./gdGPT_v1/train_ds.py", line 98, in loss = model_engine.train_batch() ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "./gdGPT_v1/deepspeed/runtime/pipe/engine.py", line 363, in train_batch self._exec_schedule(sched) File "./gdGPT_v1/deepspeed/runtime/pipe/engine.py", line 1346, in _exec_schedule self._exec_instr(**cmd.kwargs) File "./gdGPT_v1/deepspeed/runtime/pipe/engine.py", line 764, in _exec_backward_pass torch.autograd.backward(tensors=out_tensors, grad_tensors=grad_tensors) File "./miniconda3/lib/python3.11/site-packages/torch/autograd/init.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 1 of tensors does not require grad and does not have a grad_fn