ServiceNow / Fast-LLM

Accelerating your LLM training to full speed
https://servicenow.github.io/Fast-LLM/
Other
37 stars 5 forks source link

[feat] Support triton cross-entropy for larger vocabularies #52

Open jlamypoirier opened 2 days ago

jlamypoirier commented 2 days ago

🧐 Problem Description

Vocab size is limited to 64K (I think?) because of triton's limitation on the block size.

💡 Proposed Solution

The standard way is with looping over blocks, as is done for example in https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/ops/triton/cross_entropy.py (which was used as a basis for the Fast-LLM implementation).

Fast-LLM got rid of looping to simplify things and optimize the case with a smaller vocab size, so we'll need to bring looping back. Some care will be needed to keep the current performance when looping is unnecessary (looping means multiple read of the logits, etc.).

🔄 Alternatives Considered

We have other implementations, but they are much slower.

📈 Potential Benefits

Faster training and lower memory usage with large vocab sizes.

📝 Additional Context