eole-nlp / eole

Open language modeling toolkit based on PyTorch
https://eole-nlp.github.io/eole
MIT License
24 stars 6 forks source link

bfloat16 support, and an attempt at homogenizing model_dtype & precision #54

Closed francoishernandez closed 1 week ago

francoishernandez commented 2 weeks ago

bfloat16

caption

X = steps

xent

X = relative time

speed

It seems to work relatively plug-n-play, but we might need to adapt a few things optimizer-wise:

We might investigate some bf16-specific implementations, e.g. https://github.com/arogozhnikov/adamw_bfloat16

precision // model_dtype homogenization

Previously, model_dtype is used for training, with some "precision" deduced and applied depending on some other settings (optimizer), and precision is set in PredictConfig for inference. This PR proposes a factorization of precision at the common RunningConfig level, and dtype (actual dtype the model is cast to for training),is deduced with the same conditions as before.

TODOs:

francoishernandez commented 2 weeks ago

93158fe enables amp for the bfloat16 case, which seems to work fine.

Capture d’écran 2024-07-04 à 16 37 21 Capture d’écran 2024-07-04 à 16 37 07
francoishernandez commented 2 weeks ago

TODO

vince62s commented 2 weeks ago

for xlm-roberta-xl(xxl) which are natively fp32, I added this here: https://github.com/eole-nlp/eole/blob/166a18b272fb927334d109c3aa8f6e4aedf39f72/eole/bin/convert/convert_HF.py#L861 to convert them to fp16 I think since we can convert any kind of model (more and more are in bf16) maybe by default we can keep the original dtype but we can add a flag to force the storage in another dtype.