Closed philschmid closed 9 months ago
Thank you for your interest in our project.
The Apex compilation warnings are expected, I have seen these since the beginning.
The warnings.warn( "Llama is not intended to use dropout")
warnings are also fine. We should probably turn these off.
I replicate your problem when following the docs as written (also using a single node with 8x A100 80gb).
When I invoke docker with the additional arguments
--shm-size=128gb \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--memory 480G
however, it runs as expected. Please try something like this.
@AleHD please can you add this, or at least a mention that nontrivial memory is needed to shard the weights, to the "Getting Started" section? Thanks!
Thank you @kylematoba that solved it for me. I managed to shard the model but ran into a different issue during training.
Traceback (most recent call last):
File "/epfllm/./Megatron-LLM/finetune.py", line 249, in <module>
pretrain(args, data_provider, model_provider, ModelType.encoder_or_decoder,
File "/epfllm/Megatron-LLM/megatron/training.py", line 138, in pretrain
iteration = _train(args,
File "/epfllm/Megatron-LLM/megatron/training.py", line 678, in _train
train_step(forward_step_func,
File "/epfllm/Megatron-LLM/megatron/training.py", line 411, in train_step
losses_reduced = forward_backward_func(
File "/epfllm/Megatron-LLM/megatron/schedules.py", line 234, in forward_backward_no_pipelining
output_tensor = forward_step(forward_step_func, data_iterator,
File "/epfllm/Megatron-LLM/megatron/schedules.py", line 117, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
File "/epfllm/./Megatron-LLM/finetune.py", line 213, in forward_step
output_tensor = model(tokens, position_ids, attention_mask,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/distributed.py", line 58, in forward
return self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/module.py", line 186, in forward
outputs = self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/gpt_model.py", line 87, in forward
lm_output = self.language_model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/language_model.py", line 512, in forward
encoder_output = self.encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 1239, in forward
hidden_states = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 802, in forward
mlp_output, mlp_bias = self.mlp(layernorm_output)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 131, in forward
bias_gelu_impl(intermediate_parallel, bias_parallel)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 35, in forward
return bias_gelu(bias, input)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 16, in fallback_function
@torch.jit.script
def bias_gelu(bias, y):
x = bias + y
~~~~~~~~ <--- HERE
return x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))
RuntimeError: Expected a proper Tensor but got None (or an undefined Tensor in C++) for argument #0 'self'
Hi, I'm guessing that it's an OOM that's obfuscated by the JIT-ing. In cases like this I can usually recommend commenting out the @torch.jit.script
decorator to get a more helpful stack trace.
As far as I can see, you've not reported what sort of model you are trying to train. Did you look at https://epfllm.github.io/Megatron-LLM/guide/faq.html#what-are-the-basic-hardware-requirements? Only the smallest models can fit into 8x A100s 80gb.
Let me try commenting out the scripting.
I am following the getting started so its Llama 2 7B and i have 8x A100 80GBs.
that's my command
LOG_ARGS="--log_interval 1 --save_interval 100 --eval_interval 50"
TRAIN_ARGS="--train_iters 500 --lr_decay_style cosine --lr_warmup_iters 50 --lr 3e-4 --min_lr 1e-6"
DISTRIBUTED_ARGS="--nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
torchrun $DISTRIBUTED_ARGS ${MEGATRON_PATH}/finetune.py \
--tensor_model_parallel_size 4 \
--pipeline_model_parallel_size 1 \
--load ${MODEL_PATH}_sharded \
--save ${MODEL_PATH}_sharded \
--tensorboard_dir ${MODEL_PATH}_sharded \
--data_path ${DATASET_PATH}/megatron_text_document \
--model_name llama2 \
--tokenizer_type SentencePieceTokenizer \
--vocab_file=${MODEL_PATH}/tokenizer.model \
--bf16 \
--use_flash_attn \
--micro_batch_size 5 \
--global_batch_size 1000 \
--sequence_parallel \
--recompute_granularity selective \
--use_checkpoint_args \
$COMMON_ARGS $LOG_ARGS $TRAIN_ARGS $LLAMA_ARGS
The error is not really more helpful...
TypeError: unsupported operand type(s) for +: 'NoneType' and 'Tensor'
encoder_output = self.encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 1239, in forward
hidden_states = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 802, in forward
mlp_output, mlp_bias = self.mlp(layernorm_output)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 131, in forward
bias_gelu_impl(intermediate_parallel, bias_parallel)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 35, in forward
return bias_gelu(bias, input)
File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 16, in bias_gelu
x = bias + y
TypeError: unsupported operand type(s) for +: 'NoneType' and 'Tensor'
Should the getting started guide: https://epfllm.github.io/Megatron-LLM/guide/getting_started.html work e2e?
Hi, thanks for that.
I'm pretty sure the problem is something that we overlooked early on: runs without --no_bias_gelu_fusion
don't work. Please can you add that argument (like is done in the docs), and let me know how you get on?
I'll make sure this bug gets investigated in any case.
Addin --no_bias_gelu_fusion
solved the issue and it is training. Thank you for your help! Will play with it more and hopefully publish a blog post on it.!
Thanks @philschmid. I'll close this and we'll fix the bug I mention above shortly.
First, of all thank you for creating this project! It looks very exciting and interesting due its close Hugging Face Integration. I am very curious and wanted to give it a try following the Getting Started Guide in the documentation. But i ran into an error during the "Model Sharding" resulting into a
Bus error (core dumped)
.I am running on a single Node 8x A100 80GB with 1TB of memory. I followed the exact same step in the guide and used the container.
below is the full error stack in case its helpful. It includes quite a lot of weird C errors/warning in the beginning. I installed the package with
in the container.
Error Stack
```ba /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = float; U = float; V = c10::Half]’: /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:95: required from here /epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data