why customized optimizers?

joellliu commented 2 months ago

❓ The question

Hi OLMo team, thanks for the great work! As I browse through the codebase, I noticed that you have implemented your own optimizers instead of using the vanilla optimizers from pytorch. I wonder what the difference is between the pytorch implementation and your implementation, and do you observe your implementation has better performance? Thank you!

dumitrac commented 2 months ago

@joellliu , OLMo's AdamW optimizer is simply a code organization decision to group together PyTorch's AdamW optimizer and gradient clipping and metrics-collection into a single python module. Note that the gradient clipping we've been using is functionally the same as FSDP.clip_gradnorm(), but we also experimented with other forms of clipping, which had mixed results.

To summarize, our optimizer is not custom, in the sense that it's still PyTorch's AdamW, and it simply groups additional functionality in the same place which allows us to experiment better.

Does this answer your question?

joellliu commented 2 months ago

@dumitrac Thank you so much for the reply! That answers my question! I am also curious if you have tried/ implemented any techniques to increase the throughput in multi-node distributed training. We tried both OLMo and TinyLlama in our cluster but it seems the throughput per GPU of TinyLlama drops a lot when we increase the number of nodes, but OLMo stays relatively the same. So I am curious if you have done any optimization for multi-node training. Thanks!

dumitrac commented 2 months ago

@joellliu - we mainly rely on FSDP for this. We have done some work with the profiler to avoid host-device syncs and other stalls, making sure the GPUs stay busy at all times.

joellliu commented 2 months ago

@dumitrac Thanks for the reply! Can you elaborate more on how you avoid host-device syncs and other stalls? Thank you!

dumitrac commented 2 months ago

@joellliu - the goal is to keep the GPU busy at all times (or as much as possible). I'm quoting below @dirkgr 's description below:

Basically, the way training works on GPUs is that the “host”, i.e., the Python process, issues a bunch of instructions to the GPU (the “device”). Multiply this by that, calculate a mean, run some activation function, multiply the result by some other result, and so on, one after the other. These instructions go into a queue, and the GPU gets to work on its work queue doing all these tasks. Meanwhile, the host can keep running and do other stuff, issue more instructions, write log messages, read files, whatever. So the device can always be working, as long as the host can keep up issuing instructions. But sometimes, the host needs to make a decision based on the results of what the GPU is calculating. For example, when it wants to write a log of what the latest loss is. When it wants to check whether we’re still on track. When it wants to decide whether it’s time to write a checkpoint. Things like that. And at those times, the host needs to wait until the device has worked down its queue until it’s empty. Then the host does something with the result, and only then does it start filling up the queue again. This takes a long time, and it can often be subtle to see from the code when it happens. You can work around this by either a) not doing this. Just don’t do the action that causes a host-device sync. Or do it less frequently. b) GPUs actually have multiple work queues, so you can try to make sure at least one of them is always full.

joellliu commented 2 months ago

Thank you so much!

allenai / OLMo

why customized optimizers? #566

❓ The question