Closed binwang777 closed 9 months ago
python tools/checkpoint_util.py \
--target_tensor_parallel_size 2 \
--target_pipeline_parallel_size 1 \
--load_dir /path/to/megatron/weights/ \
--save_dir /path/to/sharded/weights/ \
--model_type llama2 \
--true_vocab_size 32000 \
--bf16
u can check the online doc
That's correct. More info in the getting started guide and the FAQs. Let us know if you have further questions.
python tools/checkpoint_util.py \ --target_tensor_parallel_size 2 \ --target_pipeline_parallel_size 1 \ --load_dir /path/to/megatron/weights/ \ --save_dir /path/to/sharded/weights/ \ --model_type llama2 \ --true_vocab_size 32000 \ --bf16
u can check the online doc
@dumpmemory @AleHD Hi, it seems that this conversion is not correct. Because I checked the logits for the same input, the results did not match.
can you provide a self-contained example? we don't know what model you used and in which setup (BTW the example here is for PP=1 not 2)
can you provide a self-contained example? we don't know what model you used and in which setup (BTW the example here is for PP=1 not 2)
@martinjaggi
++++++++++++++++++++++++++++++++++++++++++++++++ conversion: python tools/checkpoint_util.py \ --target_tensor_parallel_size 8 \ --target_pipeline_parallel_size 1 \ --load_dir ./model/llama2/13b-megatron \ --save_dir ./model/llama2/13b-megatron-tp8-pp1 \ --model_type llama2 \ --true_vocab_size 32000 \ --bf16
++++++++++++++++++++++++++++++++++++++++++++++++ The output of tp1: export CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node 1 \ --nnodes 1 \ --node_rank 0 \ --master_addr localhost \ --master_port 6000 output_megatron.py \ --fp16 \ --load ./model/llama2/13b-megatron \ --model_name llama2 \ --tokenizer_type SentencePieceTokenizer \ --vocab_file ./model/llama2/13b-megatron/tokenizer.model \ --tensor_model_parallel_size 1 \ --pipeline_model_parallel_size 1 \ --micro_batch_size 1 \ --variable_seq_lengths \ --use_checkpoint_args \ --use_rms_norm --glu_activation swiglu --no_tie_embed_logits --no_new_tokens --layernorm_epsilon 1e-5 \ --hidden_dropout 0.0 --attention_dropout 0.0 --no_bias_gelu_fusion
++++++++++++++++++++++++++++++++++++++++++++++++ The output of tp8: export CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node 8 \ --nnodes 1 \ --node_rank 0 \ --master_addr localhost \ --master_port 6000 output_megatron.py \ --fp16 \ --load ./model/llama2/13b-megatron-tp8-pp1 \ --model_name llama2 \ --tokenizer_type SentencePieceTokenizer \ --vocab_file ./model/llama2/13b-megatron-tp8-pp1/tokenizer.model \ --tensor_model_parallel_size 8 \ --pipeline_model_parallel_size 1 \ --micro_batch_size 1 \ --variable_seq_lengths \ --use_checkpoint_args \ --use_rms_norm --glu_activation swiglu --no_tie_embed_logits --no_new_tokens --layernorm_epsilon 1e-5 \ --hidden_dropout 0.0 --attention_dropout 0.0 --no_bias_gelu_fusion
++++++++++++++++++++++++++++++++++++++++++++++++
The output_megatron.py:
#!/usr/bin/python3
from functools import partial
import torch
from megatron import get_args
from megatron.initialize import initialize_megatron
from megatron.training import get_model
from megatron.model import GPTModel, ModelType, LlamaModel, FalconModel
from megatron import get_tokenizer
from megatron.core import tensor_parallel
from megatron.utils import get_ltor_masks_and_position_ids
from megatron import print_rank_0
from megatron.checkpointing import load_checkpoint
def model_provider(pre_process=True, post_process=True):
print_rank_0("Building model ...")
args = get_args()
if args.model_name == "gpt":
cls = GPTModel
elif args.model_name == "falcon":
cls = FalconModel
elif args.model_name in {"llama", "llama2", "codellama"}:
cls = partial(LlamaModel, version=1 if args.model_name == "llama" else 2)
else:
raise KeyError(f"Unkown model")
if isinstance(args.model_type, ModelType):
model_type = args.model_type
elif args.model_type == "encoder_or_decoder":
model_type = ModelType.encoder_or_decoder
elif args.model_type == "encoder_and_decoder":
model_type = ModelType.encoder_and_decoder
else:
raise KeyError(f"Unsupported model_type {args.model_type}")
model = cls(
num_tokentypes=0,
parallel_output=True,
pre_process=pre_process,
post_process=post_process,
model_type=model_type
)
return model
def get_batch():
args = get_args()
tokenizer = get_tokenizer()
text = "你好,加油!"
data = {"text": torch.tensor([[tokenizer.bos] + tokenizer.tokenize(text) + [tokenizer.eod]], dtype=torch.int64)}
keys = ['text']
data_b = tensor_parallel.broadcast_data(keys, data, torch.int64)
tokens = data_b['text'].long().contiguous()
attention_mask, _, position_ids = get_ltor_masks_and_position_ids(
tokens,
tokenizer.eod,
args.reset_position_ids,
args.reset_attention_mask,
args.eod_mask_loss)
return tokens, position_ids, attention_mask
def extra_args(parser):
"""Text generation arguments."""
group = parser.add_argument_group(title='validation set')
group.add_argument("--model_name",
choices={"gpt", "llama", "falcon", "llama2", "codellama"},
default="gpt")
group.add_argument("--model_type", choices={"encoder_or_decoder", "encoder_and_decoder"},
default="encoder_or_decoder")
return parser
args_defaults = {'tokenizer_type': 'SentencePieceTokenizer',
'no_load_rng': True,
'no_load_optim': True}
initialize_megatron(extra_args, args_defaults)
args = get_args()
model_type = ModelType.encoder_or_decoder
model = get_model(model_provider, model_type, wrap_with_ddp=False, args=args)
if args.load is not None:
_ = load_checkpoint(model, None, None)
model = model[0]
model.eval()
inputs = get_batch()
logits = model(input_ids=inputs[0], position_ids=inputs[1], attention_mask=inputs[2])
print_rank_0(inputs)
print_rank_0(logits)
can you provide a self-contained example? we don't know what model you used and in which setup (BTW the example here is for PP=1 not 2)
@martinjaggi Hi, could you reproduce this problem?
Could you elaborate on the error you encountered? I think you forgot to paste the output for both cases?
Could you elaborate on the error you encountered? I think you forgot to paste the output for both cases?
I checked the logits for the same input, the results did not match. You can reproduce this problem by using the above code.
When using tp=8 the logits are divided in 8 tensors, each node processes one eight of the distribution per sample. So each logits
would have shape of (batch_size, seq_len, 1/8 vocab_size)
, that's why the logits do not match when only printing the results for the first rank. Try the following modification to your script to verify each eight of the output:
# output_megatron.py
# ...
# right after model.eval()
inputs = get_batch()
_, logits = model(input_ids=inputs[0], position_ids=inputs[1], attention_mask=inputs[2])
vocabs_per_node = args.padded_vocab_size//8
if args.tensor_model_parallel_size == 1:
print("Logits size:", logits.size())
for i in range(8):
a = i*vocabs_per_node
b = (i + 1)*vocabs_per_node
print(f"Logits[{a}:{b}] (rank {i}):")
print(logits[0, :, a:b])
else:
rank = torch.distributed.get_rank()
a = rank*vocabs_per_node
b = (rank + 1)*vocabs_per_node
s = (f"\nResults for node: {rank}:\n"
f"Logits[{a}:{b}]:\n"
f"{logits[0, :, :]}")
print(s)
When running the script both with tp=1 and tp=8, the prints are almost identical.
python tools/checkpoint_util.py \ --target_tensor_parallel_size 2 \ --target_pipeline_parallel_size 1 \ --load_dir /path/to/megatron/weights/ \ --save_dir /path/to/sharded/weights/ \ --model_type llama2 \ --true_vocab_size 32000 \ --bf16
u can check the online doc
@AleHD @dumpmemory , I use this python script for llama2-7B weights and enconter an error, it seems the checkpoints are loaded successfully, and failed afterwards. I set TP=1 and PP=2. Could anyone help me with this error?
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 1
Setting consumed_train_samples to 0 and consumed_valid_samples to 0
sending embeddings
Detected CUDA files, patching ldflags
Emitting ninja build file /mpt/Megatron-LLM/megatron/fused_kernels/build/build.ninja...
[9c77a8dc6f05:3439 :0:3755] Caught signal 7 (Bus error: nonexistent physical address)
[9c77a8dc6f05:3439 :1:3786] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972560] [9c77a8dc6f05:3439 :3] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :7:3780] Caught signal 7 (Bus error: nonexistent physical address)
[9c77a8dc6f05:3439 :3:3773] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972560] [9c77a8dc6f05:3439 :2] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :2:3774] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972563] [9c77a8dc6f05:3439 :5] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :5:3798] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972565] [9c77a8dc6f05:3439 :6] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :6:3844] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972563] [9c77a8dc6f05:3439 :4] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :4:3831] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972568] [9c77a8dc6f05:3439 :10] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :10:3732] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972568] [9c77a8dc6f05:3439 :9] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :9:3742] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972568] [9c77a8dc6f05:3439 :11] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :11:3788] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972568] [9c77a8dc6f05:3439 :8] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :8:3768] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972571] [9c77a8dc6f05:3439 :14] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :14:3834] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972571] [9c77a8dc6f05:3439 :15] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :15:3811] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972573] [9c77a8dc6f05:3439 :16] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :16:3853] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972571] [9c77a8dc6f05:3439 :12] spinlock.c:29 UCX WARN ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :12:3822] Caught signal 7 (Bus error: nonexistent physical address)
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_mix_prec_layer_norm_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /mpt/Megatron-LLM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_dense_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_dense_cuda...
model_shard.sh: line 8: 3439 Bus error (core dumped) python /mpt/Megatron-LLM/tools/checkpoint_util.py --target_tensor_parallel_size 1 --target_pipeline_parallel_size 2 --load_dir ./llama2-7b/megatron/weights/ --save_dir ./llama2-7b/megatron/weights_pp2/ --model_type llama2 --true_vocab_size 32000 --bf16
@yuqie how much GPU memory do you have?
@kylematoba I use 80G A800
@yuqie my understanding is that an A800 is pretty similar to an A100, so per https://epfllm.github.io/Megatron-LLM/guide/faq.html#what-are-the-basic-hardware-requirements, if you have 2x 80Gb A800 I expect you to be able to load this. Can you try the docker args I suggest here https://github.com/epfLLM/Megatron-LLM/issues/70#issuecomment-1734140879?
In the interest of tidiness close this now. @wangyong1122 or @yuqie please reopen or raise another issue if you are still having problems.
Can anyone help me with this. I'm having a bit of trouble with it, but I didn't find a similar problem inside all the issues.
Using hf_to_megatron.py generates a weights file with TP=1,PP=1, how can I use it in TP=2,PP=2 scenario. I noticed that the embedding parameters get split in two with TP=2.