Closed ibeltagy closed 2 years ago
I can try to take this, but I want to make sure that I have the specs correct before I jump in. Is this the part of the code that requires modification?
Instead of using manually computed softmax, we want to replace it with adaptive softmax? Thanks!
I was assuming that the whole layer is in one GPU, but it doesn't seem that this is the case. At the same time, if they are split on multiple gpus, then it shouldn't OOM. Or maybe there's an option to split the logits but it is not used? @TevenLeScao, any idea?
This is the place! self.parallel_output
seems to be the option to split the logits, not sure at the moment if we're using it or not. I still feel like there's a chance it could OOM with tensor parallelism if it's on top of a tight memory allocation but it does make it less likely.
I'll look back at the OOM to see if I can pinpoint the cause to this.
Traceback (most recent call last):
File "/gpfsscratch/rech/six/uhk85as/Megatron-DeepSpeed/pretrain_gpt.py", line 230, in <module>
pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
File "/gpfsssd/scratch/rech/six/uhk85as/Megatron-DeepSpeed/megatron/training.py", line 149, in pretrain
iteration = train(forward_step_func,
File "/gpfsssd/scratch/rech/six/uhk85as/Megatron-DeepSpeed/megatron/training.py", line 692, in train
train_step(forward_step_func,
File "/gpfsssd/scratch/rech/six/uhk85as/Megatron-DeepSpeed/megatron/training.py", line 389, in train_step
loss = model[0].train_batch(data_iter=data_iterator)
File "/gpfsssd/worksf/projects/rech/six/commun/code/deepspeed-big-science/deepspeed/runtime/pipe/engine.py", line 291, in train_batch
self._exec_schedule(sched)
File "/gpfsssd/worksf/projects/rech/six/commun/code/deepspeed-big-science/deepspeed/runtime/pipe/engine.py", line 1237, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/gpfsssd/worksf/projects/rech/six/commun/code/deepspeed-big-science/deepspeed/runtime/pipe/engine.py", line 587, in _exec_forward_pass
outputs = super().forward(inputs)
File "/gpfsssd/worksf/projects/rech/six/commun/code/deepspeed-big-science/deepspeed/runtime/engine.py", line 1164, in forward
loss = self.module(*inputs, **kwargs)
File "/gpfswork/rech/six/commun/conda/hf-prod/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/gpfsssd/worksf/projects/rech/six/commun/code/deepspeed-big-science/deepspeed/runtime/pipe/module.py", line 352, in forward
x = exec_range_func(start_idx, end_idx)(*x)
File "/gpfsssd/worksf/projects/rech/six/commun/code/deepspeed-big-science/deepspeed/runtime/pipe/module.py", line 325, in exec_func
inputs = layer(inputs)
File "/gpfsssd/scratch/rech/six/uhk85as/Megatron-DeepSpeed/megatron/model/module.py", line 146, in float16_to_fp32
return conversion_helper(val, float_conversion)
File "/gpfsssd/scratch/rech/six/uhk85as/Megatron-DeepSpeed/megatron/model/module.py", line 118, in conversion_helper
return conversion(val)
File "/gpfsssd/scratch/rech/six/uhk85as/Megatron-DeepSpeed/megatron/model/module.py", line 144, in float_conversion
val = val.float()
RuntimeError: CUDA out of memory. Tried to allocate 3.82 GiB (GPU 2; 15.78 GiB total capacity; 7.34 GiB already allocated; 1.65 GiB free; 12.04 GiB reserved in total by PyTorch)
This is the traceback; not sure where it happens as the only hint is that it's in a fp16-to-fp32 conversion.
Thanks for the update @TevenLeScao! I suspect that this is where the conversion takes place.
So I assume mpu.is_pipeline_last_stage()
returns true
and it attempts the conversion. Could we simply make it skip that branch, or would this the wrong approach? (I realize we've somewhat digressed from the original solution of adaptive softmax.)
@jaketae I don't think you're looking at right place. I think it hits this line: https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/model/gpt_model.py#L267 , then you can follow the stack trace. So essentially the issue is when we upcast the logits to fp32 to compute cross entropy. So probably doesn't even hit the cross entropy layer yet.
@jaketae, let me know if you need help sorting it out
Thank you @thomasw21 @stas00!
@thomasw21 I looked at the line you referenced, but I couldn't see where/how self.specs
is used just by reading the code. I'll run some tests today and see if I can see how it's called.
But I guess the higher level question is, if we do know where the upcasting is being called, is it okay to just turn that off? Or should we try to fundamentally tackle the problem with adaptive softmax?
Gentleman, I have a silly question - I am missing some context - and it appears that you propose to switch to AdaptiveLogSoftmaxWithLoss
to use less memory? Where are you dealing with OOM in the first place?
I ask since I don't see anywhere in the paper or the pytorch docs that it says that it reduces memory usage. It discusses speed but not memory. I suppose by the nature of the approach it probably should be more efficient in memory usage as well as since it runs much faster it must be using less memory?
Thank you!
@jaketae so self.specs is how we build a PP version of GPT2, ie deepspeed supports lists of blocks. Concerning the upcasting, I'd say that it is necessary for softmax computation over the vocabulary (which is 50k big in english and 250k big in multi lingual). So I don't think removing the upcasting would be possible. TBD.
Just to update here: what I'm trying at the moment is to make PP partitioning work not only at transformer layer but also at embed layer, but ZeRO-1 is getting in the way (works w/o Z1). If we can split emd|transformer and not have the first and the last transformer layers bundled with the embed then the problem will be solved.
You can read the debug log here: https://huggingface.slack.com/archives/C01NHER1JLS/p1635816476000400
it's really neat though, you just change from type:transformer
to type:embed|transformer
and it now considers embed layers equal to transformer layer when it decides how to partition the model. Just need to figure out how to make it work ;)
FYI the OOM issue we're having with multilingual might be solved if we use MBS=1 instead of MBS=8 (TBD). If it does I'd be in favor of droping adaptive softmax. cc @TevenLeScao
We haven't added it yet, as I'm trying a different solution first as explained in my previous comment.
But you can also try other MBS between 8 and 1 :)
Update:
So I'm still waiting to hear from Shaden about the ZeRO-1 issue, but you can already use the following solution, except you have to turn ZeRO-1 off in the config file (just set it to 0):
diff --git a/megatron/model/gpt_model.py b/megatron/model/gpt_model.py
index f0bb858..8af832e 100644
--- a/megatron/model/gpt_model.py
+++ b/megatron/model/gpt_model.py
@@ -280,4 +280,4 @@ class GPTModelPipe(PipelineModule,MegatronModule):
loss_fn=CrossEntropy,
topology=topo,
activation_checkpoint_interval=interval,
- partition_method='type:transformer')
\ No newline at end of file
+ partition_method='type:embed|transformer')
This makes the embed layer on par with transformer layer when it performs the partitioning, so now it can have a whole pipeline stage for itself and thus can be much larger.
You can follow on the bug report here: https://github.com/microsoft/DeepSpeed/issues/1522
Somehow I doubt this will get solved any time soon, as this is an incompatibility between the pipe and zero.
Please let me know if you still would like me to look into the adaptive softmax?
p.s. additionally, if we figure out how to use BNB we then save quite a nice chunk of memory there - 3/4 of optimizer states (the jury is still out there of how well it works - I'm actively working on that one with Tim.) But if it works we can definitely turn ZeRO-1 off, since with 3/4 memory usage we don't really have to shard optimizer states (Which is what ZeRO-1 does), it'll take exactly the same amount of memory with BNB if sharded over 4 gpus (i.e. TP=4).
tldr: OK, so I did testing of my proposal above on 32 nodes. And it's a success conditional on either BNB or ZeRO-1 working.
I did the test with TP=4 PP=32 - same as 104B setup (but a few layers less).
So with the above change and lowering n_layers to -2, I get this new map:
0: _to_float16
1: EmbeddingPipe
2: <lambda>
3: ParallelTransformerLayerPipe
stage=1 layers=2
4: ParallelTransformerLayerPipe
5: ParallelTransformerLayerPipe
[...]
stage=30 layers=2
62: ParallelTransformerLayerPipe
63: ParallelTransformerLayerPipe
stage=31 layers=5
64: ParallelTransformerLayerPipe
65: <lambda>
66: MixedFusedLayerNorm
67: EmbeddingPipe
68: float16_to_fp32
loss: CrossEntropy
Note, we had to chop off 2 layers!
So now EmbeddingPipe
and ParallelTransformerLayerPipe
are on equal grounds. The former is about 2x larger than the latter, so the first and last stages where they share it. are 1.5x larger than stages with 2 ParallelTransformerLayerPipe
If need be we could further re-tune the partitioning to give EmbeddingPipe
more space - actually we will most likely want to do that to equalize the weights and utilize the gpus better. So a whole stage to EmbeddingPipe
(2 times) and the rest is pairs of ParallelTransformerLayerPipe
.
Memory usage of the first gpu is only 24GB now, the middle ones is only 18GB, so it's not very balanced, but as you can see there is plenty of free gpu memory to go around.
we will need to hack https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/pipe/module.py#L378-L384 to support partition_method
type:embed:2|transformer:1
- or something like that - now the embed weights will get 2x partitioning weights and will get its own stage and all stages will be more balanced.
I created an issue for it https://github.com/bigscience-workshop/Megatron-DeepSpeed/issues/186
And we will need to chop off 4 layers in the config then. So instead of 64 - it'll be 60.
Now, because of the bug in Z1 https://github.com/microsoft/DeepSpeed/issues/1522 we lose the nice sharding of optimizer states over 4 gpus (TP=4), which is a huge loss. But by adding --use-bnb-optimizer
we only use 1/4 of memory consumed by optimizer states, so we get exactly the same memory usage. If ZeRO1 gets fixed, there will be further saving.
We are in the process of experimenting with BNB to see that it produces similar results to normal Adam and thus can use it for our training.
I will update this shortly with more details and the script.
Here is the full slurm script:
Note that it requires the application of:
git cherry-pick 2f03484
git cherry-pick dab2064
if BNB doesn't work then we have to get this deepspeed issue fixed: https://github.com/microsoft/DeepSpeed/issues/1522 and then it too will work.
Actually, partition_method='parameters'
diff --git a/megatron/model/gpt_model.py b/megatron/model/gpt_model.py
index f0bb858..3a3a9db 100644
--- a/megatron/model/gpt_model.py
+++ b/megatron/model/gpt_model.py
@@ -280,4 +280,4 @@ class GPTModelPipe(PipelineModule,MegatronModule):
loss_fn=CrossEntropy,
topology=topo,
activation_checkpoint_interval=interval,
- partition_method='type:transformer')
+ partition_method='parameters')
gives us very similar partitioning as partition_method='type:embed|transformer'
, but it looks less balanced - gpu0 on node 0 consumes 4GB more than gpu1, which is not good.
| 0 N/A N/A 390848 C ...a/cutting-edge/bin/python 24071MiB |
| 1 N/A N/A 390849 C ...a/cutting-edge/bin/python 24799MiB |
| 2 N/A N/A 390850 C ...a/cutting-edge/bin/python 23999MiB |
| 3 N/A N/A 390851 C ...a/cutting-edge/bin/python 28159MiB |
So manual partitioning appears to be more beneficial.
p.s. partitioning by parameters
doesn't take into account non-parameter-related memory allocations and that's probably why we get the weird imbalance.
Closing as the reason as we reduced memory footprint without having to implement another adaptive softmax.
This: https://pytorch.org/docs/stable/generated/torch.nn.AdaptiveLogSoftmaxWithLoss.html Why: reduce memory requirements with a large vocabulary. Relevant for the multilingual experiments.