linyubupa commented 1 year ago

Bug description

when using huggingface pretrained model with multi-gpu, model parameters were duplicate for every gpu in ram

How to reproduce the bug

trainer = Trainer(
        max_epochs=1,
        devices=args.num_devices,
        precision=16,
        strategy="deepspeed_stage_3",
        accelerator='gpu',
        num_nodes=args.num_nodes,

    )
from transformers import (
    AdamW,
    GPTNeoForCausalLM,
    GPT2Tokenizer,
    AutoTokenizer,
    AutoModelForCausalLM,
    get_linear_schedule_with_warmup,
)
class AlpsModule(LightningModule):
    def __init__(
        self,
        model_name_or_path: str = "EleutherAI/gpt-j-6B",
        cache_dir: str ="/mntnlp/yumu/gpt-neo-x/" ,
        num_labels: int = 2,
        learning_rate: float = 5e-6,
        adam_epsilon: float = 3e-8,
        warmup_steps: int = 30,
        weight_decay: float = 0.01,
        **kwargs,
    ):
        super().__init__()
        self.save_hyperparameters()

        self.model = AutoModelForCausalLM.from_pretrained(model_name_or_path
                                                        ,pad_token_id=self.tokenizer.pad_token_id
                                                        ,bos_token_id=self.tokenizer.bos_token_id
                                                        ,eos_token_id=self.tokenizer.eos_token_id
                                                        , cache_dir=cache_dir
                                                         # ,low_cpu_mem_usage=True
                                                        ).half()

Error messages and logs

# Error messages and logs here please

Environment

Current environment

``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): #- Running environment of LightningApp (e.g. local, cloud): ```

More info

No response

cc @awaelchli

linyubupa commented 1 year ago

if you got multi gpu, the cost cpu memory = 2 model_size gpu_numbers

awaelchli commented 1 year ago

Hey @linyubupa

This is expected in the way you are initializing the model. I can see from the code snippet that you create the model in __init__. This isn't wrong, but for large models like yours it is inefficient. I recommend moving the initialization into this special Lightning hook:

def configure_sharded_model(self):
      self.model = AutoModelForCausalLM.from_pretrained(...)

Here is the documentation for working with deepspeed models (and also documentation for configure_sharded_model): https://pytorch-lightning.readthedocs.io/en/stable/advanced/model_parallel.html#shard-model-instantly-to-reduce-initialization-time-memory

awaelchli commented 1 year ago

Please let me know if that helps :)

cgd-bot commented 1 year ago

Please let me know if that helps :)

I had the same problem, but this method didn't solve it

Hey @linyubupa

This is expected in the way you are initializing the model. I can see from the code snippet that you create the model in __init__. This isn't wrong, but for large models like yours it is inefficient. I recommend moving the initialization into this special Lightning hook:
def configure_sharded_model(self):
      self.model = AutoModelForCausalLM.from_pretrained(...)
Here is the documentation for working with deepspeed models (and also documentation for configure_sharded_model): https://pytorch-lightning.readthedocs.io/en/stable/advanced/model_parallel.html#shard-model-instantly-to-reduce-initialization-time-memory

I had the same problem, but this method didn't solve it

imraviagrawal commented 1 year ago

Hey @linyubupa

This is expected in the way you are initializing the model. I can see from the code snippet that you create the model in __init__. This isn't wrong, but for large models like yours it is inefficient. I recommend moving the initialization into this special Lightning hook:
def configure_sharded_model(self):
      self.model = AutoModelForCausalLM.from_pretrained(...)
Here is the documentation for working with deepspeed models (and also documentation for configure_sharded_model): https://pytorch-lightning.readthedocs.io/en/stable/advanced/model_parallel.html#shard-model-instantly-to-reduce-initialization-time-memory

Yeah had same issue, and above does not solve it

linyubupa commented 1 year ago

def configure_sharded_model(self):

sorry for late reply，I build up model in configure_sharded_model , but the cpu memory still cost amountly

KzZheng commented 1 year ago

Same issue. After I put the model initialization into the configure_sharded_model, I return a new error that shows the loaded parameters are trying to assign to empty tensors.

Selection_278

KzZheng commented 1 year ago

Same issue. After I put the model initialization into the configure_sharded_model, I return a new error that shows the loaded parameters are trying to assign to empty tensors.

It seems the model initialization should be here, but loading pre-trained weights should not be put here.

munhouiani commented 1 year ago

Same issue. After I put the model initialization into the configure_sharded_model, I return a new error that shows the loaded parameters are trying to assign to empty tensors.

It seems the model initialization should be here, but loading pre-trained weights should not be put here.

Did you find a solution for this?

leeglg commented 1 year ago

any solution about this? i really need help.

saketsathe commented 1 year ago

I am facing a similar issue

KzZheng commented 1 year ago

I think one possible solution is to convert the pretrained model weights to the deepspeed zero3 shared model formate, but I haven't tried it yet.

saketsathe commented 1 year ago

Is there code to try it out?

leeglg commented 1 year ago

any solution about this? i really need help.

@saketsathe @KzZheng

i tried like this.

`def configure_sharded_model(self): print("start configure sharded model ")

모델 랜덤 초기화

    llamaconfig = LlamaConfig.from_pretrained("decapoda-research/llama-7b-hf")
    self.model = LlamaForCausalLM(llamaconfig)

    self.model.set_adapter(self.adapter_config)
    freeze_except_adapter(self.model, self.adapter_config)

    # 단일 weight의 list
    params_to_gather = [self.model.model.layers[0].self_attn.q_proj.weight]

    # 각 프로세스마다 실행됨.
    # checkpoint shard 에 있는 namedparameter 찾아서, 내 모델에서 동일한 named parameter 있으면 변경.
    # 일단 한번 출력. 이 weight가 값이 몇인지. 나중에 같은 코드로 0으로 변경된거 보기.

    # 변경 전 값 확인
    # check value before change
    with deepspeed.zero.GatheredParameters(params_to_gather, modifier_rank=0):
        print("\n random 초기화 된 weight \n", self.model.model.layers[0].self_attn.q_proj.weight[0, : 5])
        # 없음.

    time.sleep(3)

    # 1. 파일 하나씩 부르기
    # 2. 모델 파라미터랑 파일이랑 key 매칭.
    # 3. 값 넣기
    # GPU 한개 에서만 돌림.
    if torch.distributed.get_rank() == 0:
        with deepspeed.zero.GatheredParameters(params_to_gather, modifier_rank=0):
            self.model.model.layers[0].self_attn.q_proj.weight[0, : 5] = 0
        # 없음.
        #  # 체크포인트 파일 확인
        # SHARDED_FILE_PATH = "/home2/leeg/.cache/huggingface/hub/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348"

        # # 내 모델 올린거에서 state dict
        # #model_named_params = self.model.model.named_parameters()

        # # self.model.get_parameter()

        # # 체크포인트 파일 불러오기
        # PATH_LIST = [os.path.join(SHARDED_FILE_PATH, f"pytorch_model-000{i:02}-of-00033.bin") for i in range(1, 34)]
        # for PATH in tqdm(PATH_LIST):

        #     # single checkpoint shard file  's   state dict load
        #     file_state_dict = torch.load(PATH)

        #     # 내 모델에 있는 named_parameters 에서 
        #     named_parameters = dict(self.model.model.named_parameters())

        #     # 불러온 checkpoint file에 있는 key를 가지고 비교.  key가 내 모델에도 있으면, 파일에서 value 가져옴
        #     params_to_gather = [named_parameters[k] for k in file_state_dict.keys() if k in named_parameters]

        #     # for cp_k, cp_v in file_state_dict.items():
        #     #     if "inv_freq" in cp_k:
        #     #         continue

        #         # model_p = self.model.model.get_parameter(cp_k)
        #         # sharded_model_ps_dict[cp_k] = self.model.model.get_parameter(cp_k)
        #     with deepspeed.zero.GatheredParameters(params=params_to_gather, modifier_rank=0):
        #         self.model.model.load_state_dict(file_state_dict, strict=False)

    dist.barrier()

    # 각 프로세스마다 실행됨.
    # 변경 후 값 확인
    # check value after change
    with deepspeed.zero.GatheredParameters(params_to_gather, modifier_rank=0):
        print("\n barrier weight ", self.model.model.layers[0].self_attn.q_proj.weight[0, : 5])
        # 없음.`

The problem is, when i use from_pretrained("~~") in LightningModule's configure_sharded_model, Lightning Strategy Deepspeed 3 disturbs from_pretrained's assign work.

so, i tried using manual assignment rather than using from_pretrained, from sharded checkpoint file's parameter tensor to my model variable.

i didn't firmly figured out all of this, but i experimented below things.

can call my random initialized parameters from GPUs or CPUs... where ever it is sent by action of deepspeed 3 strategy.
can change 1 (parameter)'s value in context manager deepspeed.zero.GatheredParameters( ~~ ).

so, i think calling pretrained parameter file manually, and change my random initialized model parameters in deepspeed.zero.GatheredParameters is suitable approach.

linyubupa commented 1 year ago

I solve this by using deepspeed init with transformers trainer : https://huggingface.co/docs/transformers/main_classes/deepspeed 、、、 deepspeed --num_gpus 8 --num_nodes 2 --hostfile hostfile --master_addr hostname1 --master_port=9901 \ your_program.py --deepspeed ds_config.json 、、、