Use pytorch2 optimized native attention

Liuhong99 / Sophia

The official implementation of “Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training”

MIT License

931 stars 52 forks source link

Use pytorch2 optimized native attention #39

Open attesaarela opened 1 year ago

attesaarela commented 1 year ago

Hi, here is a pull request for a small speedup where attention is computed using pytorch 2 function "torch.nn.functional.scaled_dot_product_attention" if available.

Makes the optimizer run about 10% faster according to a bit of testing I did

This optimization was essentially copied from a recent version of nanoGPT