Bfloat16 in v4neo - Githubissues

BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

Apache License 2.0

12.05k stars 827 forks source link

Bfloat16 in v4neo #155

Open Hannibal046 opened 1 year ago

Hannibal046 commented 1 year ago

Hi, when following this instruction to run RWKV-v4neo on DDP, https://github.com/BlinkDL/RWKV-LM/blob/39a4d461a5102defd2a47f12b64b38466bf8ec4c/RWKV-v4neo/train.py#L23-L30

I got this error:

RuntimeError: expected scalar type BFloat16 but found Float

After digging a little bit into the code, i found in the customized cuda kernel, u is supposed to be a bf16. https://github.com/BlinkDL/RWKV-LM/blob/39a4d461a5102defd2a47f12b64b38466bf8ec4c/RWKV-v4neo/cuda/wkv_op_bf16.cpp#L5

but here u is a float: https://github.com/BlinkDL/RWKV-LM/blob/39a4d461a5102defd2a47f12b64b38466bf8ec4c/RWKV-v4neo/src/model.py#L60

A simple workaround would be changing this to u = u.contiguous().bfloat16(). And this works for me. https://github.com/BlinkDL/RWKV-LM/blob/39a4d461a5102defd2a47f12b64b38466bf8ec4c/RWKV-v4neo/src/model.py#L56

cdg1921 commented 11 months ago

我之前也遇到过这个问题，pip install pytorch-lightning==1.9.0解决了