Loading weights from hf conversion with different TP,PP settings

binwang777 commented 10 months ago

Can anyone help me with this. I'm having a bit of trouble with it, but I didn't find a similar problem inside all the issues.

Using hf_to_megatron.py generates a weights file with TP=1,PP=1, how can I use it in TP=2,PP=2 scenario. I noticed that the embedding parameters get split in two with TP=2.

dumpmemory commented 10 months ago

python tools/checkpoint_util.py \
    --target_tensor_parallel_size 2 \
    --target_pipeline_parallel_size 1 \
    --load_dir /path/to/megatron/weights/ \
    --save_dir /path/to/sharded/weights/ \
    --model_type llama2 \
    --true_vocab_size 32000 \
    --bf16

u can check the online doc

AleHD commented 10 months ago

That's correct. More info in the getting started guide and the FAQs. Let us know if you have further questions.

wangyong1122 commented 10 months ago

python tools/checkpoint_util.py \
  --target_tensor_parallel_size 2 \
  --target_pipeline_parallel_size 1 \
  --load_dir /path/to/megatron/weights/ \
  --save_dir /path/to/sharded/weights/ \
  --model_type llama2 \
  --true_vocab_size 32000 \
  --bf16

u can check the online doc

@dumpmemory @AleHD Hi, it seems that this conversion is not correct. Because I checked the logits for the same input, the results did not match.

martinjaggi commented 10 months ago

can you provide a self-contained example? we don't know what model you used and in which setup (BTW the example here is for PP=1 not 2)

wangyong1122 commented 10 months ago

can you provide a self-contained example? we don't know what model you used and in which setup (BTW the example here is for PP=1 not 2)

@martinjaggi

++++++++++++++++++++++++++++++++++++++++++++++++ conversion: python tools/checkpoint_util.py \ --target_tensor_parallel_size 8 \ --target_pipeline_parallel_size 1 \ --load_dir ./model/llama2/13b-megatron \ --save_dir ./model/llama2/13b-megatron-tp8-pp1 \ --model_type llama2 \ --true_vocab_size 32000 \ --bf16

++++++++++++++++++++++++++++++++++++++++++++++++ The output of tp1: export CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node 1 \ --nnodes 1 \ --node_rank 0 \ --master_addr localhost \ --master_port 6000 output_megatron.py \ --fp16 \ --load ./model/llama2/13b-megatron \ --model_name llama2 \ --tokenizer_type SentencePieceTokenizer \ --vocab_file ./model/llama2/13b-megatron/tokenizer.model \ --tensor_model_parallel_size 1 \ --pipeline_model_parallel_size 1 \ --micro_batch_size 1 \ --variable_seq_lengths \ --use_checkpoint_args \ --use_rms_norm --glu_activation swiglu --no_tie_embed_logits --no_new_tokens --layernorm_epsilon 1e-5 \ --hidden_dropout 0.0 --attention_dropout 0.0 --no_bias_gelu_fusion

++++++++++++++++++++++++++++++++++++++++++++++++ The output of tp8: export CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node 8 \ --nnodes 1 \ --node_rank 0 \ --master_addr localhost \ --master_port 6000 output_megatron.py \ --fp16 \ --load ./model/llama2/13b-megatron-tp8-pp1 \ --model_name llama2 \ --tokenizer_type SentencePieceTokenizer \ --vocab_file ./model/llama2/13b-megatron-tp8-pp1/tokenizer.model \ --tensor_model_parallel_size 8 \ --pipeline_model_parallel_size 1 \ --micro_batch_size 1 \ --variable_seq_lengths \ --use_checkpoint_args \ --use_rms_norm --glu_activation swiglu --no_tie_embed_logits --no_new_tokens --layernorm_epsilon 1e-5 \ --hidden_dropout 0.0 --attention_dropout 0.0 --no_bias_gelu_fusion

++++++++++++++++++++++++++++++++++++++++++++++++
The output_megatron.py:

#!/usr/bin/python3

from functools import partial

import torch
from megatron import get_args
from megatron.initialize import initialize_megatron
from megatron.training import get_model
from megatron.model import GPTModel, ModelType, LlamaModel, FalconModel

from megatron import get_tokenizer
from megatron.core import tensor_parallel
from megatron.utils import get_ltor_masks_and_position_ids
from megatron import print_rank_0
from megatron.checkpointing import load_checkpoint

def model_provider(pre_process=True, post_process=True):
    print_rank_0("Building model ...")

    args = get_args()
    if args.model_name == "gpt":
        cls = GPTModel
    elif args.model_name == "falcon":
        cls = FalconModel
    elif args.model_name in {"llama", "llama2", "codellama"}:
        cls = partial(LlamaModel, version=1 if args.model_name == "llama" else 2)
    else:
        raise KeyError(f"Unkown model")

    if isinstance(args.model_type, ModelType):
        model_type = args.model_type
    elif args.model_type == "encoder_or_decoder":
        model_type = ModelType.encoder_or_decoder
    elif args.model_type == "encoder_and_decoder":
        model_type = ModelType.encoder_and_decoder
    else:
        raise KeyError(f"Unsupported model_type {args.model_type}")

    model = cls(
        num_tokentypes=0,
        parallel_output=True,
        pre_process=pre_process,
        post_process=post_process,
        model_type=model_type
    )
    return model

def get_batch():
    args = get_args()
    tokenizer = get_tokenizer()

    text = "你好，加油！"
    data = {"text": torch.tensor([[tokenizer.bos] + tokenizer.tokenize(text) + [tokenizer.eod]], dtype=torch.int64)}
    keys = ['text']
    data_b = tensor_parallel.broadcast_data(keys, data, torch.int64)

    tokens = data_b['text'].long().contiguous()

    attention_mask, _, position_ids = get_ltor_masks_and_position_ids(
        tokens,
        tokenizer.eod,
        args.reset_position_ids,
        args.reset_attention_mask,
        args.eod_mask_loss)

    return tokens, position_ids, attention_mask

def extra_args(parser):
    """Text generation arguments."""
    group = parser.add_argument_group(title='validation set')
    group.add_argument("--model_name",
                       choices={"gpt", "llama", "falcon", "llama2", "codellama"},
                       default="gpt")
    group.add_argument("--model_type", choices={"encoder_or_decoder", "encoder_and_decoder"},
                       default="encoder_or_decoder")
    return parser

args_defaults = {'tokenizer_type': 'SentencePieceTokenizer',
                 'no_load_rng': True,
                 'no_load_optim': True}

initialize_megatron(extra_args, args_defaults)
args = get_args()

model_type = ModelType.encoder_or_decoder
model = get_model(model_provider, model_type, wrap_with_ddp=False, args=args)

if args.load is not None:
    _ = load_checkpoint(model, None, None)

model = model[0]
model.eval()

inputs = get_batch()
logits = model(input_ids=inputs[0], position_ids=inputs[1], attention_mask=inputs[2])
print_rank_0(inputs)
print_rank_0(logits)

wangyong1122 commented 10 months ago

can you provide a self-contained example? we don't know what model you used and in which setup (BTW the example here is for PP=1 not 2)

@martinjaggi Hi, could you reproduce this problem?

AleHD commented 10 months ago

Could you elaborate on the error you encountered? I think you forgot to paste the output for both cases?

wangyong1122 commented 10 months ago

Could you elaborate on the error you encountered? I think you forgot to paste the output for both cases?

I checked the logits for the same input, the results did not match. You can reproduce this problem by using the above code.

AleHD commented 10 months ago

When using tp=8 the logits are divided in 8 tensors, each node processes one eight of the distribution per sample. So each logits would have shape of (batch_size, seq_len, 1/8 vocab_size), that's why the logits do not match when only printing the results for the first rank. Try the following modification to your script to verify each eight of the output:

# output_megatron.py

# ...
# right after model.eval()

inputs = get_batch() 
_, logits = model(input_ids=inputs[0], position_ids=inputs[1], attention_mask=inputs[2])

vocabs_per_node = args.padded_vocab_size//8
if args.tensor_model_parallel_size == 1:
    print("Logits size:", logits.size()) 
    for i in range(8):  
        a = i*vocabs_per_node
        b = (i + 1)*vocabs_per_node
        print(f"Logits[{a}:{b}] (rank {i}):")
        print(logits[0, :, a:b])
else:
    rank = torch.distributed.get_rank()
    a = rank*vocabs_per_node
    b = (rank + 1)*vocabs_per_node
    s = (f"\nResults for node: {rank}:\n"
         f"Logits[{a}:{b}]:\n"
         f"{logits[0, :, :]}")
    print(s)

When running the script both with tp=1 and tp=8, the prints are almost identical.

yuqie commented 10 months ago

python tools/checkpoint_util.py \
  --target_tensor_parallel_size 2 \
  --target_pipeline_parallel_size 1 \
  --load_dir /path/to/megatron/weights/ \
  --save_dir /path/to/sharded/weights/ \
  --model_type llama2 \
  --true_vocab_size 32000 \
  --bf16

u can check the online doc

@AleHD @dumpmemory , I use this python script for llama2-7B weights and enconter an error, it seems the checkpoints are loaded successfully, and failed afterwards. I set TP=1 and PP=2. Could anyone help me with this error?

 -------------------- end of arguments ---------------------
setting number of micro-batches to constant 1
Setting consumed_train_samples to 0 and consumed_valid_samples to 0
sending embeddings
Detected CUDA files, patching ldflags
Emitting ninja build file /mpt/Megatron-LLM/megatron/fused_kernels/build/build.ninja...
[9c77a8dc6f05:3439 :0:3755] Caught signal 7 (Bus error: nonexistent physical address)
[9c77a8dc6f05:3439 :1:3786] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972560] [9c77a8dc6f05:3439 :3]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :7:3780] Caught signal 7 (Bus error: nonexistent physical address)
[9c77a8dc6f05:3439 :3:3773] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972560] [9c77a8dc6f05:3439 :2]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :2:3774] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972563] [9c77a8dc6f05:3439 :5]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :5:3798] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972565] [9c77a8dc6f05:3439 :6]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :6:3844] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972563] [9c77a8dc6f05:3439 :4]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :4:3831] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972568] [9c77a8dc6f05:3439 :10]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :10:3732] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972568] [9c77a8dc6f05:3439 :9]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :9:3742] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972568] [9c77a8dc6f05:3439 :11]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :11:3788] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972568] [9c77a8dc6f05:3439 :8]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :8:3768] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972571] [9c77a8dc6f05:3439 :14]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :14:3834] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972571] [9c77a8dc6f05:3439 :15]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :15:3811] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972573] [9c77a8dc6f05:3439 :16]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :16:3853] Caught signal 7 (Bus error: nonexistent physical address)
[1695633523.972571] [9c77a8dc6f05:3439 :12]        spinlock.c:29   UCX  WARN  ucs_recursive_spinlock_destroy() failed: busy
[9c77a8dc6f05:3439 :12:3822] Caught signal 7 (Bus error: nonexistent physical address)
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_mix_prec_layer_norm_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /mpt/Megatron-LLM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_dense_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_dense_cuda...
model_shard.sh: line 8:  3439 Bus error               (core dumped) python /mpt/Megatron-LLM/tools/checkpoint_util.py --target_tensor_parallel_size 1 --target_pipeline_parallel_size 2 --load_dir ./llama2-7b/megatron/weights/ --save_dir ./llama2-7b/megatron/weights_pp2/ --model_type llama2 --true_vocab_size 32000 --bf16

kylematoba commented 10 months ago

@yuqie how much GPU memory do you have?

yuqie commented 10 months ago

@kylematoba I use 80G A800

kylematoba commented 10 months ago

@yuqie my understanding is that an A800 is pretty similar to an A100, so per https://epfllm.github.io/Megatron-LLM/guide/faq.html#what-are-the-basic-hardware-requirements, if you have 2x 80Gb A800 I expect you to be able to load this. Can you try the docker args I suggest here https://github.com/epfLLM/Megatron-LLM/issues/70#issuecomment-1734140879?

kylematoba commented 9 months ago

In the interest of tidiness close this now. @wangyong1122 or @yuqie please reopen or raise another issue if you are still having problems.

epfLLM / Megatron-LLM

Loading weights from hf conversion with different TP,PP settings #63