Strage error while launching the code

zepmck commented 1 year ago

Hi all, every time I try to launch the code for fine-tuning on a DGX A100 system (8 GPUs) either in serial or in parallel I get the following error. Any suggestion how to fix it?

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of 
a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple 
concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change 
during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple 
`checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by 
different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support
such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over
iterations.
Parameter at index 288 with name base_model.model.gpt_neox.layers.35.mlp.dense_4h_to_h.lora_B.default.weight has been 
marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this
iteration.

zepmck commented 1 year ago

Is there anyone who managed to launch the finetuning code? I am using the guanaco finetuning 7b sample script but still getting the above error. Any help would be really appreciated. Tnx.

artidoro commented 12 months ago

I haven't seen that error before, I would suggest using one GPU for debugging as this might be related to DDP. One A100 should easily fit the 7B model. https://discuss.pytorch.org/t/ddp-and-gradient-checkpointing/132244

artidoro / qlora

Strage error while launching the code #204