LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
36.94k stars 3.22k forks source link

Add flash-attention patch for falcon-7b #3580

Closed andreaskoepf closed 1 year ago

andreaskoepf commented 1 year ago

Enable the use_flash_attention configuration flag for Falcon models. When use_flash_attention is set to true the FalconAttention.forwad() method is replaced with a variant that uses Tri Dao's flash_attention instead of pytorch's scaled_dot_product_attention function.

At the moment the patch works only for falcon-7b but technically it will also work for falcon-40b with the right configuration. The falcon model situation is currently a bit messy: The Falcon model was recently added to Huggingface transformers (see PR transformers#24523) but the falcon models on the hugginface hub use still the code which is shipped together with the weights (a PR to change this was reverted). Falcon-7b and 40b use both slightly different code (which was unified in the HF transformers impl and can there be controlled via a configuration member called new_decoder_architecture see configuration_falcon.py#L65-L67). The HF Falcon impl uses different names in the configuration class, e.g. compare new configuration_falcon.py and old configuration_RW.py

HF Falcon implementation compatible model configurations can be found here: 7B: config.json 40B: config.json