The XL Model and the latest DeepSpeed

Time flies swiftly in the world of ML. Sparse models have lost their popularity, and the code for them is no longer maintained. The older version of Triton isn't compatible with modern hardware, and DeepSpeed's sparse attention functionality doesn't work with the newer Triton versions.

However, there's good news: a workaround exists. Sparse attention isn't truly necessary for the XL model to function. Instead, it can be converted into a dense model. Simply remove the block corresponding to sparse_attention in the deepspeed_config file, and voila - all sparse layers are instantly transformed into dense layers. This setup aligns perfectly with existing weights, eliminating the need for retraining. This is logical since sparsity essentially serves as a mask, reducing most token weights to zero, but due to the softmax function, most weights are zero most of the time anyway, with only the most important ones being non-zero.

It's worthwhile to test the perplexity of the resulting dense model; its performance may even improve. The underlying principle is similar to that of dropout - when it is turned off during inference, the quality is often superior compared to during training.

Most importantly, you don't need a custom-built DeepSpeed for inferring the resulting model. Enjoy!

ai-forever / ru-gpts

The XL Model and the latest DeepSpeed #111