BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.
Apache License 2.0
12.05k stars 827 forks source link

TypeError loss backward() takes 2 positional arguments but 3 were given #139

Closed Alchemy5 closed 1 year ago

Alchemy5 commented 1 year ago

When attempting finetuning with different rwkv models (e.g.,"RWKV/rwkv-raven-1b5") I keep running into the error that loss backward() takes 2 positional arguments but 3 were given when attempting to do a backward pass.

image
seahrh commented 1 year ago

I get the same error when trying to fine-tune RWKV/rwkv-4-7b-pile. It seems the error is coming from deepspeed fp16 loss scaling. What should be the right fp16 loss scaling for rwkv4?

File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 1923, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 62, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
TypeError: backward() takes 2 positional arguments but 3 were given
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 278) of binary: /opt/conda/bin/python3.8

my deepspeed config + pytorch lightning Trainer

from pytorch_lightning.strategies import DeepSpeedStrategy

strategy = DeepSpeedStrategy(
    stage=3,
    offload_optimizer=True,
    offload_parameters=True,
)
trainer = Trainer(
    default_root_dir=self.conf["job_dir"],
    accelerator="gpu" if torch.cuda.is_available() else None,
    devices=devices,
    strategy=strategy,
    precision=32 if strategy is None else "bf16",
    max_epochs=self.conf.getint("epochs"),
    callbacks=training_callbacks(patience=self.conf.getint("patience")),
    deterministic=False,
    logger=CSVLogger(save_dir=self.conf["job_dir"]),
)

fp16 training options - what should I change here?

"fp16": {
    "enabled": true,
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "consecutive_hysteresis": false,
    "min_loss_scale": 1
}
BlinkDL commented 1 year ago

use v4neo deepspeed==0.7.0 pytorch-lightning==1.9.2 torch 1.13.1+cu117