SparkJiao / llama-pipeline-parallel

A prototype repo for hybrid training of pipeline parallel and distributed data parallel with comments on core code snippets. Feel free to copy code and launch discussions about the problems you have encoured.
44 stars 2 forks source link

How Can I use your code to load LLama2? #6

Closed fmh1art closed 5 months ago

fmh1art commented 7 months ago

Hi, thanks for great work!

I want to use your code to build a PipelineModule object from LLama2. Here is my code:

def load_model(neox_args):
    config = transformers.AutoConfig.from_pretrained('XXX/pyllama2/7B')
    import llama_pipeline_parallel.models.llama_ds_mp_wrap as pipline_model_wrapper
    from deepspeed.pipe import PipelineModule
    layers = pipline_model_wrapper.get_layers_from_config(config)
    model_pipe = PipelineModule(layers=layers,
                            topology=mpu.get_topology(),
                            loss_fn=pipline_model_wrapper.loss_fn,
                            )
    model_pipe.load_checkpoint('XXX/pyllama2/7B/ckp', load_module_only=True, load_optimizer_states=False, load_lr_scheduler_states=False)
    print('===========================================================================================================================')
    print(neox_args)
    print('===========================================================================================================================')
    return model_pipe

I make sure that i have converted HF format by using convert2ckpt.py to checkpoint. But occur an error:

[2023-12-30 13:49:38,182] [INFO] [checkpointing.py:227:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory `/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/data'
make: Nothing to be done for `default'.
make: Leaving directory `/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/data'
Traceback (most recent call last):
  File "train.py", line 30, in <module>
    main()
  File "train.py", line 27, in main
    pretrain(neox_args=neox_args)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 194, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 657, in setup_model_and_optimizer
Traceback (most recent call last):
  File "train.py", line 30, in <module>
    model = load_model(neox_args=neox_args)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 644, in load_model
    main()
  File "train.py", line 27, in main
    model_pipe = PipelineModule(layers=layers,
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 188, in __init__
    pretrain(neox_args=neox_args)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 194, in pretrain
    self._partition_layers(method=partition_method)
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 384, in _partition_layers
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 657, in setup_model_and_optimizer
    param_counts = self._count_layer_params()
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 290, in _count_layer_params
    model = load_model(neox_args=neox_args)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 644, in load_model
    l = layer.build()
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 74, in build
    model_pipe = PipelineModule(layers=layers,
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 188, in __init__
    return self.typename(*self.module_args, **self.module_kwargs)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/llama_pipeline_parallel/models/llama_ds_mp_wrap.py", line 142, in __init__
    self._partition_layers(method=partition_method)
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 384, in _partition_layers
    super().__init__(config)
TypeError: __init__() missing 1 required positional argument: 'layer_idx'
    param_counts = self._count_layer_params()
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 290, in _count_layer_params
    l = layer.build()
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 74, in build
    return self.typename(*self.module_args, **self.module_kwargs)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/llama_pipeline_parallel/models/llama_ds_mp_wrap.py", line 142, in __init__
    super().__init__(config)
TypeError: __init__() missing 1 required positional argument: 'layer_idx'
Traceback (most recent call last):
  File "train.py", line 30, in <module>
    main()
  File "train.py", line 27, in main
    pretrain(neox_args=neox_args)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 194, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 657, in setup_model_and_optimizer
    model = load_model(neox_args=neox_args)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 644, in load_model
    model_pipe = PipelineModule(layers=layers,
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 188, in __init__
    self._partition_layers(method=partition_method)
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 384, in _partition_layers
    param_counts = self._count_layer_params()
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 290, in _count_layer_params
    l = layer.build()
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 74, in build
    return self.typename(*self.module_args, **self.module_kwargs)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/llama_pipeline_parallel/models/llama_ds_mp_wrap.py", line 142, in __init__
    super().__init__(config)
TypeError: __init__() missing 1 required positional argument: 'layer_idx'
Traceback (most recent call last):
  File "train.py", line 30, in <module>
    main()
  File "train.py", line 27, in main
    pretrain(neox_args=neox_args)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 194, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 657, in setup_model_and_optimizer
    model = load_model(neox_args=neox_args)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 644, in load_model
    model_pipe = PipelineModule(layers=layers,
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 188, in __init__
    self._partition_layers(method=partition_method)
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 384, in _partition_layers
    param_counts = self._count_layer_params()
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 290, in _count_layer_params
    l = layer.build()
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 74, in build
    return self.typename(*self.module_args, **self.module_kwargs)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/llama_pipeline_parallel/models/llama_ds_mp_wrap.py", line 142, in __init__
    super().__init__(config)
TypeError: __init__() missing 1 required positional argument: 'layer_idx'
Traceback (most recent call last):
  File "train.py", line 30, in <module>
    main()
  File "train.py", line 27, in main
    pretrain(neox_args=neox_args)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 194, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 657, in setup_model_and_optimizer
    model = load_model(neox_args=neox_args)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 644, in load_model
    model_pipe = PipelineModule(layers=layers,
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 188, in __init__
    self._partition_layers(method=partition_method)
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 384, in _partition_layers
    param_counts = self._count_layer_params()
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 290, in _count_layer_params
    l = layer.build()
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 74, in build
    return self.typename(*self.module_args, **self.module_kwargs)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/llama_pipeline_parallel/models/llama_ds_mp_wrap.py", line 142, in __init__
    super().__init__(config)
TypeError: __init__() missing 1 required positional argument: 'layer_idx'
Traceback (most recent call last):
  File "train.py", line 30, in <module>
    main()
  File "train.py", line 27, in main
    pretrain(neox_args=neox_args)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 194, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 657, in setup_model_and_optimizer
    model = load_model(neox_args=neox_args)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 644, in load_model
    model_pipe = PipelineModule(layers=layers,
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 188, in __init__
    self._partition_layers(method=partition_method)
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 384, in _partition_layers
    param_counts = self._count_layer_params()
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 290, in _count_layer_params
    l = layer.build()
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 74, in build
    return self.typename(*self.module_args, **self.module_kwargs)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/llama_pipeline_parallel/models/llama_ds_mp_wrap.py", line 142, in __init__
    super().__init__(config)
TypeError: __init__() missing 1 required positional argument: 'layer_idx'
Traceback (most recent call last):
  File "train.py", line 30, in <module>
    main()
  File "train.py", line 27, in main
    pretrain(neox_args=neox_args)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 194, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 657, in setup_model_and_optimizer
    model = load_model(neox_args=neox_args)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 644, in load_model
    model_pipe = PipelineModule(layers=layers,
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 188, in __init__
    self._partition_layers(method=partition_method)
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 384, in _partition_layers
    param_counts = self._count_layer_params()
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 290, in _count_layer_params
    l = layer.build()
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 74, in build
    return self.typename(*self.module_args, **self.module_kwargs)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/llama_pipeline_parallel/models/llama_ds_mp_wrap.py", line 142, in __init__
    super().__init__(config)
TypeError: __init__() missing 1 required positional argument: 'layer_idx'

How can I fix it?

fmh1art commented 7 months ago

I have also tried the first way in your ReadME.md file, and I have succeed to build a PipelineModule object from LLama2. However, when I try to build an Adam optimizer based on the parameters, I occur another error.

This is my code for building the optimizer:

def get_params_for_weight_decay_optimization(module, neox_args):
    """Divide params into with-weight-decay and without-weight-decay groups.
    Layernorms and biases will have no weight decay but the rest will.
    """
    print('-----------------------------------------print(module and modules())------------------------------------------------')
    print(module)
    print(module.modules())
    print('---------------------------------------------------------------------------------------------------')
    weight_decay_params = {"params": []}
    no_weight_decay_params = {"params": [], "weight_decay": 0.0}
    for module_ in module.modules():
        if any(
            [
                isinstance(module_, LayerNorm),
                isinstance(module_, RMSNorm),
                isinstance(module_, ScaleNorm),
            ]
        ) or (
            neox_args.weight_decay == 0.0
        ):  # also include all parameters here if no weight decay is being done
            no_weight_decay_params["params"].extend(
                [p for p in list(module_._parameters.values()) if p is not None]
            )
            print('-------------------------enter if----------------------------')
            print(module_)
        else:
            print('-------------------------enter else----------------------------')
            print(module_)
            weight_decay_params["params"].extend(
                [
                    p
                    for n, p in list(module_._parameters.items())
                    if p is not None and n != "bias"
                ]
            )
            no_weight_decay_params["params"].extend(
                [
                    p
                    for n, p in list(module_._parameters.items())
                    if p is not None and n == "bias"
                ]
            )
    if neox_args.weight_decay == 0.0:
        # only return a single param group
        # with onebitadam, we want to minimize the calls to compressed_allreduce. Every param group calls it once.
        # to avoid this, only use a single param group when weight decay is off.
        return [no_weight_decay_params]
    return weight_decay_params, no_weight_decay_params

def get_optimizer(model, neox_args):
    """Set up the optimizer."""
    if neox_args.no_load_optim:
        return None, None

    if neox_args.optimizer is None:
        print_rank_0(
            f"ERROR: Optimizer is None. Either set the optimizer dict in your config (if training) or set no_load_optim in your config (if inference)"
        )
        exit()
    # Build parameter groups (weight decay and non-decay).
    param_groups = get_params_for_weight_decay_optimization(model, neox_args)
    print_rank_0(
        f'Configuring Optimizer type: {neox_args.optimizer_type} with params: {neox_args.optimizer["params"]}'
    )

    # Add model parallel attribute if it is not set.
    for param_group in param_groups:
        for param in param_group["params"]:
            if not hasattr(param, "model_parallel"):
                param.model_parallel = False

    # Filter out params that don't require a grad (for soft prompt tuning, etc.)
    _param_groups = []
    for param_group in param_groups:
        trainable_params = [p for p in param_group["params"] if p.requires_grad]
        param_group["params"] = trainable_params
        _param_groups.append(param_group)
    param_groups = _param_groups

    # If we're using mup, then the optimizer must be adam or sgd
    assert not neox_args.use_mup or (
        neox_args.optimizer_type.lower() == "adam"
        or neox_args.optimizer_type.lower() == "sgd"
    ), f"If use_mup == True, you must specify either the adam or sgd optimizers. You passed: {neox_args.optimizer_type.lower()}"

    if neox_args.optimizer_type.lower() in ["cpu_adam", "cpu_torch_adam"]:
        ...
    elif neox_args.optimizer_type.lower() == "adam":
        # Use Adam
        if neox_args.use_mup:
            try:
                from mup import MuAdam

                adam_optimizer = MuAdam
            except ModuleNotFoundError:
                print("Please install mup https://github.com/microsoft/mup")
                raise Exception
        else:
            if neox_args.use_bnb_optimizer:
                try:
                    import bitsandbytes as bnb

                    adam_optimizer = bnb.optim.Adam8bit
                except ModuleNotFoundError:
                    print(
                        "Please install bitsandbytes following https://github.com/facebookresearch/bitsandbytes."
                    )
                    raise Exception
            else:
                try:
                    # default to apex as it's slightly faster
                    from apex.optimizers import FusedAdam as Adam
                except ImportError:
                    # if apex isn't installed, use deepspeed's FusedAdam
                    print(
                        "WARNING: APEX not installed - defaulting to deepspeed's fused adam"
                    )
                    from deepspeed.ops.adam import FusedAdam as Adam
                adam_optimizer = Adam # go this
        print(f'adam_optimizer: {adam_optimizer}')
        optimizer = adam_optimizer(
            param_groups,
            weight_decay=neox_args.weight_decay,
            **neox_args.optimizer["params"],
        )
    elif neox_args.optimizer_type.lower() == "sgd":
        try:
            from mup import MuSGD
        except ModuleNotFoundError:
            print("Please install mup https://github.com/microsoft/mup")
            raise Exception
        optimizer = MuSGD(
            param_groups,
            weight_decay=neox_args.weight_decay,
            **neox_args.optimizer["params"],
        )
    else:
        raise ValueError(f"Optimizer type {neox_args.optimizer_type} not recognized")

    if neox_args.deepspeed:
        # fp16 wrapper is not required for DeepSpeed.
        return optimizer, param_groups
    else:
        raise ValueError("Must be using deepspeed to run neox")

The error is:

  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 194, in pretrain
    pretrain(neox_args=neox_args)
Traceback (most recent call last):
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 194, in pretrain
  File "train.py", line 30, in <module>
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 674, in setup_model_and_optimizer
    self._check_for_duplicates(basic_optimizer)
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1150, in _check_for_duplicates
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 674, in setup_model_and_optimizer
        model, optimizer, _, lr_scheduler = deepspeed.initialize(main()

  File "train.py", line 27, in main
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/__init__.py", line 186, in initialize
        model, optimizer, _, lr_scheduler = deepspeed.initialize(assert occurrence <= 1, f"Parameter with name: {name} occurs multiple times in optimizer.param_groups. Make sure it only appears once to prevent undefined behavior."

  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/__init__.py", line 186, in initialize
AssertionError: Parameter with name: tied_modules.weight.weight occurs multiple times in optimizer.param_groups. Make sure it only appears once to prevent undefined behavior.
    pretrain(neox_args=neox_args)
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 194, in pretrain
    engine = PipelineEngine(args=args,
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 68, in __init__
    engine = PipelineEngine(args=args,
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 68, in __init__
    super().__init__(*super_args, **super_kwargs)
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 304, in __init__
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/home/u2019000171/___fmh/pythia/gpt-neox-main/megatron/training.py", line 674, in setup_model_and_optimizer
    super().__init__(*super_args, **super_kwargs)
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 304, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1227, in _configure_optimizer
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1227, in _configure_optimizer
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/__init__.py", line 186, in initialize
    self._check_for_duplicates(basic_optimizer)
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1150, in _check_for_duplicates
    self._check_for_duplicates(basic_optimizer)
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1150, in _check_for_duplicates
    engine = PipelineEngine(args=args,
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 68, in __init__
    assert occurrence <= 1, f"Parameter with name: {name} occurs multiple times in optimizer.param_groups. Make sure it only appears once to prevent undefined behavior."
AssertionError: Parameter with name: tied_modules.weight.weight occurs multiple times in optimizer.param_groups. Make sure it only appears once to prevent undefined behavior.
    assert occurrence <= 1, f"Parameter with name: {name} occurs multiple times in optimizer.param_groups. Make sure it only appears once to prevent undefined behavior."
AssertionError: Parameter with name: tied_modules.weight.weight occurs multiple times in optimizer.param_groups. Make sure it only appears once to prevent undefined behavior.
    super().__init__(*super_args, **super_kwargs)
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 304, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1227, in _configure_optimizer
    self._check_for_duplicates(basic_optimizer)
  File "/home/u2019000171/.conda/envs/gptneox/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1150, in _check_for_duplicates
    assert occurrence <= 1, f"Parameter with name: {name} occurs multiple times in optimizer.param_groups. Make sure it only appears once to prevent undefined behavior."
AssertionError: Parameter with name: tied_modules.weight.weight occurs multiple times in optimizer.param_groups. Make sure it only appears once to prevent undefined behavior.
SparkJiao commented 7 months ago

For the first problem, I'm not sure if you have modified the layer __init__ method. By using get_layers_from_config method, it will call ParallelTransformerLayerPipe, which does not require layer_idx for initialization.

For the second problem, it seems that it is caused by weight tying of word embedding, I remember that Llama2 does not use weight tying, so you may check if you have enable this anywhere.

SparkJiao commented 7 months ago

Also, I'm not familiar with manually construction optimizer, since it could be better to pass this to deepspeed engine for initialization

fmh1art commented 7 months ago

Thanks a lot for your reply! Can I understand that as this repo is only suitable for Llama instead of Llama2, due to the difference on model structure?

SparkJiao commented 7 months ago

No, I believe they are the same, because I just override the class of the original Llama layer to only change the formats of inputs and outputs.