BlackSamorez / tensor_parallel

Automatically split your PyTorch models on multiple GPUs for training & inference
MIT License
619 stars 38 forks source link

why raised cuda error? #95

Closed YooSungHyun closed 1 year ago

YooSungHyun commented 1 year ago

i used tiiuae/falcon-40b

and want to doing full fine-tuning by lima instruction dataset

model = tp.tensor_parallel(model, sharded=True)

just use like this and i have 1) A100 80GB 2 and another server 2) A100 80GB 4 but when i running code on 1) device or 2) device raised error like this Model parameters were moved to incorrect devices, did call on model.cuda() or model.to(device)? If so, please avoid doing that

why?

BlackSamorez commented 1 year ago

This error is raised during forward pass when model parameters are found on a device they are not expected to be on. This usually happens when parameters are moved after making model tensor_parallel. Can you provide a little more code to specify how you deploy the model and what happens between model = tp.tensor_parallel(model, sharded=True) and model(...)?

YooSungHyun commented 1 year ago

this is my all train code.... i don't know why moving parameter to another device...😅

import copy
import logging
import os

import torch
import tensor_parallel as tp
import torch._dynamo
from datasets import load_dataset
from setproctitle import setproctitle
from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, Trainer
from transformers.trainer_utils import is_main_process

from arguments import DatasetsArguments, ModelArguments, MyTrainingArguments
from utils import DataCollatorForSupervisedDataset

torch._dynamo.config.verbose = True
UNUSED0 = ">>QUESTION<<"
UNUSED1 = ">>ANSWER<<"
IGNORE_INDEX = -100
os.environ["TORCHDYNAMO_DISABLE"] = "1"

def main(model_args: ModelArguments, dataset_args: DatasetsArguments, training_args: MyTrainingArguments):
    setproctitle(model_args.model_name_or_path + "/" + dataset_args.data_path + "/finetuning")

    tokenizer = AutoTokenizer.from_pretrained(
        model_args.model_name_or_path, model_max_length=model_args.max_length, use_fast=False
    )
    tokenizer.pad_token_id = tokenizer.pad_token_id or tokenizer.eos_token_id

    model = AutoModelForCausalLM.from_pretrained(
        model_args.model_name_or_path,
        low_cpu_mem_usage=True,
        use_cache=False,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
        offload_state_dict=True,
    )
    model = tp.tensor_parallel(model, sharded=True)
    dataset = load_dataset(dataset_args.data_path, split="train")

    def preprocess(raw):
        # text = raw["prompt"].replace(USER, UNUSED0).replace(SYSTEM, UNUSED1) + raw['instruction'] + tokenizer.eos_token
        input_text = UNUSED0 + raw["conversations"][0] + tokenizer.eos_token + UNUSED1
        label_text = raw["conversations"][1] + tokenizer.eos_token
        total_text = input_text + label_text
        input_seq_token_len = len(tokenizer(input_text)["input_ids"])
        tokenized_text = tokenizer(total_text, return_token_type_ids=False, return_tensors="pt")
        raw["input_ids"] = tokenized_text["input_ids"][0]
        raw["attention_mask"] = tokenized_text["attention_mask"][0]

        labels_ids = copy.deepcopy(raw["input_ids"])
        labels_ids[:input_seq_token_len] = IGNORE_INDEX
        raw["labels"] = labels_ids
        return raw

    dataset = dataset.map(preprocess, remove_columns=dataset.column_names)
    dataset = dataset.filter(lambda x: len(x["input_ids"]) <= model_args.max_length)
    dataset.set_format("torch")

    data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
    if training_args.local_rank == 0:
        import wandb

        wandb.init(
            project=training_args.wandb_project,
            entity=training_args.wandb_entity,
            name=training_args.wandb_name,
        )

    trainer = Trainer(
        model=model.cuda(),
        data_collator=data_collator,
        train_dataset=dataset,
        args=training_args,
    )
    trainer.train()
    with tp.save_tensor_parallel(model):
        trainer.save_model(training_args.output_dir)
        tokenizer.save_pretrained(training_args.output_dir)

if __name__ == "__main__":
    parser = HfArgumentParser((ModelArguments, DatasetsArguments, MyTrainingArguments))
    model_args, dataset_args, training_args = parser.parse_args_into_dataclasses()
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if is_main_process(training_args.local_rank) else logging.WARN,
    )
    main(model_args=model_args, dataset_args=dataset_args, training_args=training_args)
YooSungHyun commented 1 year ago

and another question, i want to training 1model to shard 2 or 4 EA multi-gpus, how can i script in?

i put it in like this,

model_name_or_path="tiiuae/falcon-40b"
TENSOR_PARALLEL_USE_NATIVE=1 \
python3 train_falcon_single_turn.py \
    --output_dir "falcon-40b-test" \
    --model_name_or_path "${model_name_or_path}" \
    --data_path "GAIR/lima" \
    --max_length 2048 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --learning_rate 2e-5 \
    --gradient_accumulation_steps 64 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --logging_strategy "steps" \
    --logging_steps 10 \
    --save_total_limit 5 \
    --wandb_name "falcon-40b-lima-single-turn" \
    --wandb_entity "test_by_me" \
    --wandb_project "LLM-finetune-test" \
    --dataloader_num_workers 0 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --remove_unused_columns False \
    --optim adafactor \
    --torch_compile True \
    --bf16 True \
    --tf32 True

i think, 1 model to multi gpu, so don't launch torchrun. if i run that shell script, it is correctly 1model to multi-gpus shard learning?

BlackSamorez commented 1 year ago

Firstly, don't call model.cuda(). This is what's causing the issue. Secondly, yes. The script above should split one model among all available GPUs.

YooSungHyun commented 1 year ago

when if i don't call model.cuda(), but error is raise samely

  0%|                                                                                                                    | 0/15 [00:00<?, ?it/s](device(type='cuda', index=0), device(type='cuda', index=1))
Traceback (most recent call last):
  File "/ssd/data01/bart/LLM42/train/train_falcon_single_turn.py", line 90, in <module>
    main(model_args=model_args, dataset_args=dataset_args, training_args=training_args)
  File "/ssd/data01/bart/LLM42/train/train_falcon_single_turn.py", line 76, in main
    trainer.train()
  File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 2759, in training_step
    loss = self.compute_loss(model, inputs)
  File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 2784, in compute_loss
    outputs = model(**inputs)
  File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 553, in forward
    return model_forward(*args, **kwargs)
  File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 541, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/tensor_parallel/pretrained_model.py", line 78, in forward
    return self.wrapped_model(*args, **kwargs)
  File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/tensor_parallel/sharding.py", line 95, in forward
    return self.module(*args, **kwargs)
  File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/tensor_parallel/tensor_parallel.py", line 124, in forward
    raise ValueError(
ValueError: Model parameters were moved to incorrect devices, did call on model.cuda() or model.to(device)? If so, please avoid doing that

i am monitoring my gpu memory on nvidia-smi, if gpu memory is not enough, error raise like that? because my gpu 0's A100 is going to 79000MiB/81920MiB and error is raised

YooSungHyun commented 1 year ago

my code is changed

import copy
import json
import logging
import os

import accelerate
import tensor_parallel as tp
import torch
import torch._dynamo
from arguments import DatasetsArguments, ModelArguments, MyTrainingArguments
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from setproctitle import setproctitle
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, LlamaTokenizer, Trainer
from transformers.trainer_utils import is_main_process
from transformers.utils.bitsandbytes import replace_with_bnb_linear
from transformers.utils.quantization_config import BitsAndBytesConfig
from utils import DataCollatorForSupervisedDataset

torch._dynamo.config.verbose = True
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["TORCHDYNAMO_DISABLE"] = "1"

UNUSED0 = ">>QUESTION<<"
UNUSED1 = ">>ANSWER<<"
IGNORE_INDEX = -100

def main(model_args: ModelArguments, dataset_args: DatasetsArguments, training_args: MyTrainingArguments):
    setproctitle("train_qlora_tp")
    assert (
        training_args.tp_gpu_num is not None
    ), "Maybe recommended running ddp or something, this is TENSOR PARALLEL running!"
    device_ids = list()
    for gpu_num in training_args.tp_gpu_num:
        device_ids.append("cuda:" + gpu_num)

    if model_args.model_name_or_path == "decapoda-research/llama-30b-hf":
        tokenizer = LlamaTokenizer.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
    else:
        tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
        tokenizer.pad_token = tokenizer.eos_token

    if training_args.bf16:
        torch_dtype = torch.bfloat16
    elif training_args.fp16:
        torch_dtype = torch.float16
    else:
        torch_dtype = torch.float32

    with accelerate.init_empty_weights():
        model = AutoModelForCausalLM.from_config(
            AutoConfig.from_pretrained(model_args.model_name_or_path, trust_remote_code=True, torch_dtype=torch_dtype),
            trust_remote_code=True,
        )

    model = tp.TensorParallelPreTrainedModel(  # <- tensor parallelism starts here
        model,
        device_ids=device_ids,
    )

    model = replace_with_bnb_linear(
        model,
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            llm_int8_threshold=6.0,
            llm_int8_has_fp16_weight=False,
            bnb_4bit_compute_dtype=torch_dtype,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        ),
    )
    model.is_loaded_in_4bit = True

    device_map = tp.infer_sharded_device_map(model)  # <- The model is on meta device but we can sill deduce
    #    the target devices for each weight using this helper function

    # Get nums parts
    with open(os.path.join(model_args.model_name_or_path, "pytorch_model.bin.index.json"), "r") as index_file:
        shard_filenames = set(json.load(index_file)["weight_map"].values())

    for shard_filename in sorted(shard_filenames):
        # Download a shard
        shard_path = os.path.join(model_args.model_name_or_path, shard_filename)

        # Convert model shard
        converted_state_dict = tp.convert_state_dict(  # <- tensor_parallel helper function.
            torch.load(shard_path),
            model.tensor_parallel_config,
            world_size=2,
            for_pretrained=True,
        )

        # Dispatch the shard
        for param_name, param in converted_state_dict.items():
            module_name = param_name

            while len(module_name) > 0 and module_name not in device_map:
                module_name = ".".join(module_name.split(".")[:-1])
            param_device = device_map[module_name]

            accelerate.utils.set_module_tensor_to_device(model, param_name, param_device, value=param)
            converted_state_dict[param_name] = None
        del converted_state_dict

    # def get_num_layers(model):
    #     numbers = set()
    #     for name, _ in model.named_parameters():
    #         for number in re.findall(r"\d+", name):
    #             numbers.add(int(number))
    #     return max(numbers)

    # def get_last_layer_linears(model):
    #     names = []

    #     num_layers = get_num_layers(model)
    #     for name, module in model.named_modules():
    #         print(name)
    #         if str(num_layers) in name and not "encoder" in name:
    #             if isinstance(module, torch.nn.Linear):
    #                 names.append(name)
    #     return names

    lora_config = LoraConfig(
        r=16, lora_alpha=32, target_modules=["query_key_value", "dense"], lora_dropout=0.05, bias="none"
    )
    model = get_peft_model(model, lora_config)
    model.config.use_cache = False
    dataset = load_dataset(dataset_args.data_path, split="train")

    def preprocess(raw):
        # text = raw["prompt"].replace(USER, UNUSED0).replace(SYSTEM, UNUSED1) + raw['instruction'] + tokenizer.eos_token
        input_text = UNUSED0 + raw["conversations"][0] + tokenizer.eos_token + UNUSED1
        label_text = raw["conversations"][1] + tokenizer.eos_token
        total_text = input_text + label_text
        input_seq_token_len = len(tokenizer(input_text)["input_ids"])
        tokenized_text = tokenizer(total_text, return_token_type_ids=False, return_tensors="pt")
        raw["input_ids"] = tokenized_text["input_ids"][0]
        raw["attention_mask"] = tokenized_text["attention_mask"][0]

        labels_ids = copy.deepcopy(raw["input_ids"])
        labels_ids[:input_seq_token_len] = IGNORE_INDEX
        raw["labels"] = labels_ids
        return raw

    dataset = dataset.map(preprocess, remove_columns=dataset.column_names)
    dataset = dataset.filter(lambda x: len(x["input_ids"]) <= model_args.max_length)
    dataset.set_format("torch")

    data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)

    if training_args.local_rank == 0:
        import wandb

        wandb.init(
            project=training_args.wandb_project,
            entity=training_args.wandb_entity,
            name=training_args.wandb_name,
        )

    if training_args.gradient_checkpointing:
        logging.warning("TensorParallelPreTrainedModel does not support gradient checkpointing.")
        training_args.gradient_checkpointing = False

    trainer = Trainer(
        model=model,
        data_collator=data_collator,
        train_dataset=dataset,
        # eval_dataset=dataset["test"],
        args=training_args,
    )
    trainer.train()
    with tp.save_tensor_parallel(model):
        trainer.save_model(training_args.output_dir)
        tokenizer.save_pretrained(training_args.output_dir)

if __name__ == "__main__":
    parser = HfArgumentParser((ModelArguments, DatasetsArguments, MyTrainingArguments))
    model_args, dataset_args, training_args = parser.parse_args_into_dataclasses()
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if is_main_process(training_args.local_rank) else logging.WARN,
    )
    main(model_args=model_args, dataset_args=dataset_args, training_args=training_args)

falcon-40B tensor parallel is only work commit hash: c47b371b31a68349c233104050ac76680b8485db tp_gpu_num 0 1 and bf16 True

YooSungHyun commented 1 year ago

image image

YooSungHyun commented 1 year ago

but, before when i go to trainer.train(), that is good on each gpu image

YooSungHyun commented 1 year ago

@BlackSamorez

transformers.trainer._inner_training_loop.user_accelerator_prepare make _sanity_check_params's cuda device same.... i will find some more....

YooSungHyun commented 1 year ago

image this make cuda device same image

YooSungHyun commented 1 year ago

if i changed like this

# as the model is wrapped, don't use `accelerator.prepare`
# this is for unhandled cases such as
# Fairscale Sharded DDP, FSDP-XLA, SageMaker MP/DP, DataParallel, IPEX
use_accelerator_prepare = False
# use_accelerator_prepare = True if model is self.model else False

work is fine, what is my problem??? or bug???

@BlackSamorez image

YooSungHyun commented 1 year ago

the model is already wrapped TensorParellel, but why use_accelerator_prepare is True? image

in comment, as the model is wrapped, don't use 'accelerator.prepare'

YooSungHyun commented 1 year ago

https://github.com/BlackSamorez/tensor_parallel/blob/main/examples/training_flan-t5-xl.ipynb

i think that example is accured same error too...🤣

what kind of transformers did you use?

YooSungHyun commented 1 year ago

upper transformers v4.30.0, i think this error is accured

YooSungHyun commented 1 year ago

i've override my trainer, but training is not good... my dataset is lima...

i don't know how can i fix it. change learning rate and optimizer, but my training is not change dynamically image

dragosconst commented 1 year ago

I have the same problem with transformers version 4.30.0 or greater (seems to work with 4.29, but without fp16 enabled for some reason...).

YooSungHyun commented 1 year ago

i think, my overrided trainer is good. lima training is done successfully....

@dragosconst maybe, you have to override transformers trainer...

YooSungHyun commented 1 year ago

first of all, i'm success training to using Tensor Parallel, and now, i use deepspeed ZeRO3. so, this issues is not necessary any more. so, i closed.