dvgodoy / PyTorchStepByStep

Official repository of my book: "Deep Learning with PyTorch Step-by-Step: A Beginner's Guide"
https://pytorchstepbystep.com
MIT License
855 stars 321 forks source link

will there be newer technologies in upcoming series? #55

Open jdgh000 opened 2 days ago

jdgh000 commented 2 days ago

I see SBS has 3 volumes which I came to know by searching through amazon, so far I like attention and transformer covered but we are increasingly working with even latter ones: ie. - flash attention - increasingly used to speed inference.training more by reducing memory traffic and integrating / fusing various GPU tasks - https://github.com/Dao-AILab/flash-attention. By that, attention is already getting outdated as flash attention is getting much more performant with same load. splitK - even have more obscure understanding, i believe it is sectioning keys to GPU units to parallelize the copuetation more.

While those have some demo codes provided to showcase benefit, will be nice if SBS exanples also adopt those newer trends for continuity if there are newer volumes planned.

dvgodoy commented 2 days ago

Hi @jdgh000

I see your point, this field evolves pretty fast, and there are new tricks and tweaks being constantly released.

I wouldn't go as far as saying that attention is outdated, though. The concept of attention remains as relevant as ever, what FlashAttention brings to the table is "just" doing the same thing in a more memory-efficient way. In its core, FlashAttention works by implementing online (or tiled) softmax (the bottleneck of regular attention) and allocating and populating a single variable/structure in memory. Apart from that, there are other things such as KV-caching and Multi-Query Attention (MQA) that also decrease memory usage by using fewer and pre-computed keys and values, so the bulk of attention itself is relying mostly on the queries. There is also sliding attention and sparsing attention, which reduce the number of tokens to which attention is computed, thus allowing for longer contexts. All these things, however, are just tweaks to how attention is computed or to which tokens it is applied, the fundamental concept remains unchanged.

I haven't heard of splitK before but, from what I saw, it seems pretty low-level.

Right now, I'm working on a new and short book that is focused on engineering topics one must know in order to fine-tune LLMs. Quantization, low-rank adapters, and of course, Flash Attention. You can learn more about this upcoming book here: https://leanpub.com/finetuning

I believe you'll like its TOC, as it goes along the lines of your suggestion :-)

You can also sign up there to get notified when it's published (and get a coupon too!)

Best, Daniel