add support for adaptive softmax to reduce memory

ibeltagy commented 3 years ago

This: https://pytorch.org/docs/stable/generated/torch.nn.AdaptiveLogSoftmaxWithLoss.html Why: reduce memory requirements with a large vocabulary. Relevant for the multilingual experiments.

jaketae commented 3 years ago

I can try to take this, but I want to make sure that I have the specs correct before I jump in. Is this the part of the code that requires modification?

https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/da31db6431b3f3b93da75b3d0d48753e0826ecd8/megatron/mpu/cross_entropy.py#L67-L80

Instead of using manually computed softmax, we want to replace it with adaptive softmax? Thanks!

ibeltagy commented 3 years ago

I was assuming that the whole layer is in one GPU, but it doesn't seem that this is the case. At the same time, if they are split on multiple gpus, then it shouldn't OOM. Or maybe there's an option to split the logits but it is not used? @TevenLeScao, any idea?

TevenLeScao commented 3 years ago

This is the place! self.parallel_output seems to be the option to split the logits, not sure at the moment if we're using it or not. I still feel like there's a chance it could OOM with tensor parallelism if it's on top of a tight memory allocation but it does make it less likely.

I'll look back at the OOM to see if I can pinpoint the cause to this.

TevenLeScao commented 3 years ago

Traceback (most recent call last):
  File "/gpfsscratch/rech/six/uhk85as/Megatron-DeepSpeed/pretrain_gpt.py", line 230, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/gpfsssd/scratch/rech/six/uhk85as/Megatron-DeepSpeed/megatron/training.py", line 149, in pretrain
    iteration = train(forward_step_func,
  File "/gpfsssd/scratch/rech/six/uhk85as/Megatron-DeepSpeed/megatron/training.py", line 692, in train
    train_step(forward_step_func,
  File "/gpfsssd/scratch/rech/six/uhk85as/Megatron-DeepSpeed/megatron/training.py", line 389, in train_step
    loss = model[0].train_batch(data_iter=data_iterator)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/deepspeed-big-science/deepspeed/runtime/pipe/engine.py", line 291, in train_batch
    self._exec_schedule(sched)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/deepspeed-big-science/deepspeed/runtime/pipe/engine.py", line 1237, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/deepspeed-big-science/deepspeed/runtime/pipe/engine.py", line 587, in _exec_forward_pass
    outputs = super().forward(inputs)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/deepspeed-big-science/deepspeed/runtime/engine.py", line 1164, in forward
    loss = self.module(*inputs, **kwargs)
  File "/gpfswork/rech/six/commun/conda/hf-prod/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/deepspeed-big-science/deepspeed/runtime/pipe/module.py", line 352, in forward
    x = exec_range_func(start_idx, end_idx)(*x)
  File "/gpfsssd/worksf/projects/rech/six/commun/code/deepspeed-big-science/deepspeed/runtime/pipe/module.py", line 325, in exec_func
    inputs = layer(inputs)
  File "/gpfsssd/scratch/rech/six/uhk85as/Megatron-DeepSpeed/megatron/model/module.py", line 146, in float16_to_fp32
    return conversion_helper(val, float_conversion)
  File "/gpfsssd/scratch/rech/six/uhk85as/Megatron-DeepSpeed/megatron/model/module.py", line 118, in conversion_helper
    return conversion(val)
  File "/gpfsssd/scratch/rech/six/uhk85as/Megatron-DeepSpeed/megatron/model/module.py", line 144, in float_conversion
    val = val.float()
RuntimeError: CUDA out of memory. Tried to allocate 3.82 GiB (GPU 2; 15.78 GiB total capacity; 7.34 GiB already allocated; 1.65 GiB free; 12.04 GiB reserved in total by PyTorch)

This is the traceback; not sure where it happens as the only hint is that it's in a fp16-to-fp32 conversion.

jaketae commented 3 years ago

Thanks for the update @TevenLeScao! I suspect that this is where the conversion takes place.

https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/45dcd528fe4d79ca704a81b32da4dbb9993ac4bb/megatron/model/module.py#L169-L175

So I assume mpu.is_pipeline_last_stage() returns true and it attempts the conversion. Could we simply make it skip that branch, or would this the wrong approach? (I realize we've somewhat digressed from the original solution of adaptive softmax.)

thomasw21 commented 3 years ago

@jaketae I don't think you're looking at right place. I think it hits this line: https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/model/gpt_model.py#L267 , then you can follow the stack trace. So essentially the issue is when we upcast the logits to fp32 to compute cross entropy. So probably doesn't even hit the cross entropy layer yet.

stas00 commented 3 years ago

@jaketae, let me know if you need help sorting it out

jaketae commented 3 years ago

Thank you @thomasw21 @stas00!

@thomasw21 I looked at the line you referenced, but I couldn't see where/how self.specs is used just by reading the code. I'll run some tests today and see if I can see how it's called.

But I guess the higher level question is, if we do know where the upcasting is being called, is it okay to just turn that off? Or should we try to fundamentally tackle the problem with adaptive softmax?

stas00 commented 3 years ago

Gentleman, I have a silly question - I am missing some context - and it appears that you propose to switch to AdaptiveLogSoftmaxWithLoss to use less memory? Where are you dealing with OOM in the first place?

I ask since I don't see anywhere in the paper or the pytorch docs that it says that it reduces memory usage. It discusses speed but not memory. I suppose by the nature of the approach it probably should be more efficient in memory usage as well as since it runs much faster it must be using less memory?

Thank you!

thomasw21 commented 3 years ago

@jaketae so self.specs is how we build a PP version of GPT2, ie deepspeed supports lists of blocks. Concerning the upcasting, I'd say that it is necessary for softmax computation over the vocabulary (which is 50k big in english and 250k big in multi lingual). So I don't think removing the upcasting would be possible. TBD.

stas00 commented 3 years ago

Just to update here: what I'm trying at the moment is to make PP partitioning work not only at transformer layer but also at embed layer, but ZeRO-1 is getting in the way (works w/o Z1). If we can split emd|transformer and not have the first and the last transformer layers bundled with the embed then the problem will be solved.

You can read the debug log here: https://huggingface.slack.com/archives/C01NHER1JLS/p1635816476000400

it's really neat though, you just change from type:transformer to type:embed|transformer and it now considers embed layers equal to transformer layer when it decides how to partition the model. Just need to figure out how to make it work ;)

thomasw21 commented 3 years ago

FYI the OOM issue we're having with multilingual might be solved if we use MBS=1 instead of MBS=8 (TBD). If it does I'd be in favor of droping adaptive softmax. cc @TevenLeScao

stas00 commented 3 years ago

We haven't added it yet, as I'm trying a different solution first as explained in my previous comment.

But you can also try other MBS between 8 and 1 :)

stas00 commented 3 years ago

Update:

So I'm still waiting to hear from Shaden about the ZeRO-1 issue, but you can already use the following solution, except you have to turn ZeRO-1 off in the config file (just set it to 0):

diff --git a/megatron/model/gpt_model.py b/megatron/model/gpt_model.py
index f0bb858..8af832e 100644
--- a/megatron/model/gpt_model.py
+++ b/megatron/model/gpt_model.py
@@ -280,4 +280,4 @@ class GPTModelPipe(PipelineModule,MegatronModule):
                          loss_fn=CrossEntropy,
                          topology=topo,
                          activation_checkpoint_interval=interval,
-                         partition_method='type:transformer')
\ No newline at end of file
+                         partition_method='type:embed|transformer')

This makes the embed layer on par with transformer layer when it performs the partitioning, so now it can have a whole pipeline stage for itself and thus can be much larger.

You can follow on the bug report here: https://github.com/microsoft/DeepSpeed/issues/1522

Somehow I doubt this will get solved any time soon, as this is an incompatibility between the pipe and zero.

Please let me know if you still would like me to look into the adaptive softmax?

p.s. additionally, if we figure out how to use BNB we then save quite a nice chunk of memory there - 3/4 of optimizer states (the jury is still out there of how well it works - I'm actively working on that one with Tim.) But if it works we can definitely turn ZeRO-1 off, since with 3/4 memory usage we don't really have to shard optimizer states (Which is what ZeRO-1 does), it'll take exactly the same amount of memory with BNB if sharded over 4 gpus (i.e. TP=4).

stas00 commented 3 years ago

tldr: OK, so I did testing of my proposal above on 32 nodes. And it's a success conditional on either BNB or ZeRO-1 working.

I did the test with TP=4 PP=32 - same as 104B setup (but a few layers less).

So with the above change and lowering n_layers to -2, I get this new map:

     0: _to_float16
     1: EmbeddingPipe
     2: <lambda>
     3: ParallelTransformerLayerPipe
stage=1 layers=2
     4: ParallelTransformerLayerPipe
     5: ParallelTransformerLayerPipe
[...]
stage=30 layers=2
    62: ParallelTransformerLayerPipe
    63: ParallelTransformerLayerPipe
stage=31 layers=5
    64: ParallelTransformerLayerPipe
    65: <lambda>
    66: MixedFusedLayerNorm
    67: EmbeddingPipe
    68: float16_to_fp32
  loss: CrossEntropy

Note, we had to chop off 2 layers!

So now EmbeddingPipe and ParallelTransformerLayerPipe are on equal grounds. The former is about 2x larger than the latter, so the first and last stages where they share it. are 1.5x larger than stages with 2 ParallelTransformerLayerPipe

If need be we could further re-tune the partitioning to give EmbeddingPipe more space - actually we will most likely want to do that to equalize the weights and utilize the gpus better. So a whole stage to EmbeddingPipe (2 times) and the rest is pairs of ParallelTransformerLayerPipe.

Memory usage of the first gpu is only 24GB now, the middle ones is only 18GB, so it's not very balanced, but as you can see there is plenty of free gpu memory to go around.

we will need to hack https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/pipe/module.py#L378-L384 to support partition_method type:embed:2|transformer:1 - or something like that - now the embed weights will get 2x partitioning weights and will get its own stage and all stages will be more balanced. I created an issue for it https://github.com/bigscience-workshop/Megatron-DeepSpeed/issues/186

And we will need to chop off 4 layers in the config then. So instead of 64 - it'll be 60.

Now, because of the bug in Z1 https://github.com/microsoft/DeepSpeed/issues/1522 we lose the nice sharding of optimizer states over 4 gpus (TP=4), which is a huge loss. But by adding --use-bnb-optimizer we only use 1/4 of memory consumed by optimizer states, so we get exactly the same memory usage. If ZeRO1 gets fixed, there will be further saving.

We are in the process of experimenting with BNB to see that it produces similar results to normal Adam and thus can use it for our training.

I will update this shortly with more details and the script.

Here is the full slurm script:

tr8-104B-mt5-32-30m.txt

Note that it requires the application of:

the diff in https://github.com/bigscience-workshop/Megatron-DeepSpeed/issues/166#issuecomment-961594612
cherry-picking https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/182
```
git cherry-pick 2f03484
git cherry-pick dab2064
```

if BNB doesn't work then we have to get this deepspeed issue fixed: https://github.com/microsoft/DeepSpeed/issues/1522 and then it too will work.

stas00 commented 3 years ago

Actually, partition_method='parameters'

diff --git a/megatron/model/gpt_model.py b/megatron/model/gpt_model.py
index f0bb858..3a3a9db 100644
--- a/megatron/model/gpt_model.py
+++ b/megatron/model/gpt_model.py
@@ -280,4 +280,4 @@ class GPTModelPipe(PipelineModule,MegatronModule):
                          loss_fn=CrossEntropy,
                          topology=topo,
                          activation_checkpoint_interval=interval,
-                         partition_method='type:transformer')
+                         partition_method='parameters')

gives us very similar partitioning as partition_method='type:embed|transformer', but it looks less balanced - gpu0 on node 0 consumes 4GB more than gpu1, which is not good.

|    0   N/A  N/A    390848      C   ...a/cutting-edge/bin/python    24071MiB |
|    1   N/A  N/A    390849      C   ...a/cutting-edge/bin/python    24799MiB |
|    2   N/A  N/A    390850      C   ...a/cutting-edge/bin/python    23999MiB |
|    3   N/A  N/A    390851      C   ...a/cutting-edge/bin/python    28159MiB |

So manual partitioning appears to be more beneficial.

p.s. partitioning by parameters doesn't take into account non-parameter-related memory allocations and that's probably why we get the weird imbalance.

thomasw21 commented 2 years ago

Closing as the reason as we reduced memory footprint without having to implement another adaptive softmax.

bigscience-workshop / Megatron-DeepSpeed

add support for adaptive softmax to reduce memory #166