Closed YooSungHyun closed 1 year ago
This error is raised during forward pass when model parameters are found on a device they are not expected to be on. This usually happens when parameters are moved after making model tensor_parallel
.
Can you provide a little more code to specify how you deploy the model and what happens between model = tp.tensor_parallel(model, sharded=True)
and model(...)
?
this is my all train code.... i don't know why moving parameter to another device...😅
import copy
import logging
import os
import torch
import tensor_parallel as tp
import torch._dynamo
from datasets import load_dataset
from setproctitle import setproctitle
from transformers import AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, Trainer
from transformers.trainer_utils import is_main_process
from arguments import DatasetsArguments, ModelArguments, MyTrainingArguments
from utils import DataCollatorForSupervisedDataset
torch._dynamo.config.verbose = True
UNUSED0 = ">>QUESTION<<"
UNUSED1 = ">>ANSWER<<"
IGNORE_INDEX = -100
os.environ["TORCHDYNAMO_DISABLE"] = "1"
def main(model_args: ModelArguments, dataset_args: DatasetsArguments, training_args: MyTrainingArguments):
setproctitle(model_args.model_name_or_path + "/" + dataset_args.data_path + "/finetuning")
tokenizer = AutoTokenizer.from_pretrained(
model_args.model_name_or_path, model_max_length=model_args.max_length, use_fast=False
)
tokenizer.pad_token_id = tokenizer.pad_token_id or tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
model_args.model_name_or_path,
low_cpu_mem_usage=True,
use_cache=False,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
offload_state_dict=True,
)
model = tp.tensor_parallel(model, sharded=True)
dataset = load_dataset(dataset_args.data_path, split="train")
def preprocess(raw):
# text = raw["prompt"].replace(USER, UNUSED0).replace(SYSTEM, UNUSED1) + raw['instruction'] + tokenizer.eos_token
input_text = UNUSED0 + raw["conversations"][0] + tokenizer.eos_token + UNUSED1
label_text = raw["conversations"][1] + tokenizer.eos_token
total_text = input_text + label_text
input_seq_token_len = len(tokenizer(input_text)["input_ids"])
tokenized_text = tokenizer(total_text, return_token_type_ids=False, return_tensors="pt")
raw["input_ids"] = tokenized_text["input_ids"][0]
raw["attention_mask"] = tokenized_text["attention_mask"][0]
labels_ids = copy.deepcopy(raw["input_ids"])
labels_ids[:input_seq_token_len] = IGNORE_INDEX
raw["labels"] = labels_ids
return raw
dataset = dataset.map(preprocess, remove_columns=dataset.column_names)
dataset = dataset.filter(lambda x: len(x["input_ids"]) <= model_args.max_length)
dataset.set_format("torch")
data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
if training_args.local_rank == 0:
import wandb
wandb.init(
project=training_args.wandb_project,
entity=training_args.wandb_entity,
name=training_args.wandb_name,
)
trainer = Trainer(
model=model.cuda(),
data_collator=data_collator,
train_dataset=dataset,
args=training_args,
)
trainer.train()
with tp.save_tensor_parallel(model):
trainer.save_model(training_args.output_dir)
tokenizer.save_pretrained(training_args.output_dir)
if __name__ == "__main__":
parser = HfArgumentParser((ModelArguments, DatasetsArguments, MyTrainingArguments))
model_args, dataset_args, training_args = parser.parse_args_into_dataclasses()
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO if is_main_process(training_args.local_rank) else logging.WARN,
)
main(model_args=model_args, dataset_args=dataset_args, training_args=training_args)
and another question, i want to training 1model to shard 2 or 4 EA multi-gpus, how can i script in?
i put it in like this,
model_name_or_path="tiiuae/falcon-40b"
TENSOR_PARALLEL_USE_NATIVE=1 \
python3 train_falcon_single_turn.py \
--output_dir "falcon-40b-test" \
--model_name_or_path "${model_name_or_path}" \
--data_path "GAIR/lima" \
--max_length 2048 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--learning_rate 2e-5 \
--gradient_accumulation_steps 64 \
--evaluation_strategy "no" \
--save_strategy "epoch" \
--logging_strategy "steps" \
--logging_steps 10 \
--save_total_limit 5 \
--wandb_name "falcon-40b-lima-single-turn" \
--wandb_entity "test_by_me" \
--wandb_project "LLM-finetune-test" \
--dataloader_num_workers 0 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--remove_unused_columns False \
--optim adafactor \
--torch_compile True \
--bf16 True \
--tf32 True
i think, 1 model to multi gpu, so don't launch torchrun. if i run that shell script, it is correctly 1model to multi-gpus shard learning?
Firstly, don't call model.cuda()
. This is what's causing the issue.
Secondly, yes. The script above should split one model among all available GPUs.
when if i don't call model.cuda()
, but error is raise samely
0%| | 0/15 [00:00<?, ?it/s](device(type='cuda', index=0), device(type='cuda', index=1))
Traceback (most recent call last):
File "/ssd/data01/bart/LLM42/train/train_falcon_single_turn.py", line 90, in <module>
main(model_args=model_args, dataset_args=dataset_args, training_args=training_args)
File "/ssd/data01/bart/LLM42/train/train_falcon_single_turn.py", line 76, in main
trainer.train()
File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 2759, in training_step
loss = self.compute_loss(model, inputs)
File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 2784, in compute_loss
outputs = model(**inputs)
File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 553, in forward
return model_forward(*args, **kwargs)
File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 541, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/tensor_parallel/pretrained_model.py", line 78, in forward
return self.wrapped_model(*args, **kwargs)
File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/tensor_parallel/sharding.py", line 95, in forward
return self.module(*args, **kwargs)
File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/ssd/data01/bart/LLM42/.venv/lib/python3.10/site-packages/tensor_parallel/tensor_parallel.py", line 124, in forward
raise ValueError(
ValueError: Model parameters were moved to incorrect devices, did call on model.cuda() or model.to(device)? If so, please avoid doing that
i am monitoring my gpu memory on nvidia-smi
, if gpu memory is not enough, error raise like that?
because my gpu 0's A100
is going to 79000MiB/81920MiB and error is raised
my code is changed
import copy
import json
import logging
import os
import accelerate
import tensor_parallel as tp
import torch
import torch._dynamo
from arguments import DatasetsArguments, ModelArguments, MyTrainingArguments
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from setproctitle import setproctitle
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, LlamaTokenizer, Trainer
from transformers.trainer_utils import is_main_process
from transformers.utils.bitsandbytes import replace_with_bnb_linear
from transformers.utils.quantization_config import BitsAndBytesConfig
from utils import DataCollatorForSupervisedDataset
torch._dynamo.config.verbose = True
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["TORCHDYNAMO_DISABLE"] = "1"
UNUSED0 = ">>QUESTION<<"
UNUSED1 = ">>ANSWER<<"
IGNORE_INDEX = -100
def main(model_args: ModelArguments, dataset_args: DatasetsArguments, training_args: MyTrainingArguments):
setproctitle("train_qlora_tp")
assert (
training_args.tp_gpu_num is not None
), "Maybe recommended running ddp or something, this is TENSOR PARALLEL running!"
device_ids = list()
for gpu_num in training_args.tp_gpu_num:
device_ids.append("cuda:" + gpu_num)
if model_args.model_name_or_path == "decapoda-research/llama-30b-hf":
tokenizer = LlamaTokenizer.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
else:
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
if training_args.bf16:
torch_dtype = torch.bfloat16
elif training_args.fp16:
torch_dtype = torch.float16
else:
torch_dtype = torch.float32
with accelerate.init_empty_weights():
model = AutoModelForCausalLM.from_config(
AutoConfig.from_pretrained(model_args.model_name_or_path, trust_remote_code=True, torch_dtype=torch_dtype),
trust_remote_code=True,
)
model = tp.TensorParallelPreTrainedModel( # <- tensor parallelism starts here
model,
device_ids=device_ids,
)
model = replace_with_bnb_linear(
model,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch_dtype,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
),
)
model.is_loaded_in_4bit = True
device_map = tp.infer_sharded_device_map(model) # <- The model is on meta device but we can sill deduce
# the target devices for each weight using this helper function
# Get nums parts
with open(os.path.join(model_args.model_name_or_path, "pytorch_model.bin.index.json"), "r") as index_file:
shard_filenames = set(json.load(index_file)["weight_map"].values())
for shard_filename in sorted(shard_filenames):
# Download a shard
shard_path = os.path.join(model_args.model_name_or_path, shard_filename)
# Convert model shard
converted_state_dict = tp.convert_state_dict( # <- tensor_parallel helper function.
torch.load(shard_path),
model.tensor_parallel_config,
world_size=2,
for_pretrained=True,
)
# Dispatch the shard
for param_name, param in converted_state_dict.items():
module_name = param_name
while len(module_name) > 0 and module_name not in device_map:
module_name = ".".join(module_name.split(".")[:-1])
param_device = device_map[module_name]
accelerate.utils.set_module_tensor_to_device(model, param_name, param_device, value=param)
converted_state_dict[param_name] = None
del converted_state_dict
# def get_num_layers(model):
# numbers = set()
# for name, _ in model.named_parameters():
# for number in re.findall(r"\d+", name):
# numbers.add(int(number))
# return max(numbers)
# def get_last_layer_linears(model):
# names = []
# num_layers = get_num_layers(model)
# for name, module in model.named_modules():
# print(name)
# if str(num_layers) in name and not "encoder" in name:
# if isinstance(module, torch.nn.Linear):
# names.append(name)
# return names
lora_config = LoraConfig(
r=16, lora_alpha=32, target_modules=["query_key_value", "dense"], lora_dropout=0.05, bias="none"
)
model = get_peft_model(model, lora_config)
model.config.use_cache = False
dataset = load_dataset(dataset_args.data_path, split="train")
def preprocess(raw):
# text = raw["prompt"].replace(USER, UNUSED0).replace(SYSTEM, UNUSED1) + raw['instruction'] + tokenizer.eos_token
input_text = UNUSED0 + raw["conversations"][0] + tokenizer.eos_token + UNUSED1
label_text = raw["conversations"][1] + tokenizer.eos_token
total_text = input_text + label_text
input_seq_token_len = len(tokenizer(input_text)["input_ids"])
tokenized_text = tokenizer(total_text, return_token_type_ids=False, return_tensors="pt")
raw["input_ids"] = tokenized_text["input_ids"][0]
raw["attention_mask"] = tokenized_text["attention_mask"][0]
labels_ids = copy.deepcopy(raw["input_ids"])
labels_ids[:input_seq_token_len] = IGNORE_INDEX
raw["labels"] = labels_ids
return raw
dataset = dataset.map(preprocess, remove_columns=dataset.column_names)
dataset = dataset.filter(lambda x: len(x["input_ids"]) <= model_args.max_length)
dataset.set_format("torch")
data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
if training_args.local_rank == 0:
import wandb
wandb.init(
project=training_args.wandb_project,
entity=training_args.wandb_entity,
name=training_args.wandb_name,
)
if training_args.gradient_checkpointing:
logging.warning("TensorParallelPreTrainedModel does not support gradient checkpointing.")
training_args.gradient_checkpointing = False
trainer = Trainer(
model=model,
data_collator=data_collator,
train_dataset=dataset,
# eval_dataset=dataset["test"],
args=training_args,
)
trainer.train()
with tp.save_tensor_parallel(model):
trainer.save_model(training_args.output_dir)
tokenizer.save_pretrained(training_args.output_dir)
if __name__ == "__main__":
parser = HfArgumentParser((ModelArguments, DatasetsArguments, MyTrainingArguments))
model_args, dataset_args, training_args = parser.parse_args_into_dataclasses()
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO if is_main_process(training_args.local_rank) else logging.WARN,
)
main(model_args=model_args, dataset_args=dataset_args, training_args=training_args)
falcon-40B tensor parallel is only work commit hash: c47b371b31a68349c233104050ac76680b8485db tp_gpu_num 0 1 and bf16 True
but, before when i go to trainer.train(), that is good on each gpu
@BlackSamorez
transformers.trainer._inner_training_loop.user_accelerator_prepare
make _sanity_check_params's cuda device same....
i will find some more....
this make cuda device same
if i changed like this
# as the model is wrapped, don't use `accelerator.prepare`
# this is for unhandled cases such as
# Fairscale Sharded DDP, FSDP-XLA, SageMaker MP/DP, DataParallel, IPEX
use_accelerator_prepare = False
# use_accelerator_prepare = True if model is self.model else False
work is fine, what is my problem??? or bug???
@BlackSamorez
the model is already wrapped TensorParellel, but why use_accelerator_prepare
is True
?
in comment, as the model is wrapped, don't use 'accelerator.prepare'
https://github.com/BlackSamorez/tensor_parallel/blob/main/examples/training_flan-t5-xl.ipynb
i think that example is accured same error too...🤣
what kind of transformers did you use?
upper transformers v4.30.0, i think this error is accured
i've override my trainer, but training is not good... my dataset is lima...
i don't know how can i fix it. change learning rate and optimizer, but my training is not change dynamically
I have the same problem with transformers version 4.30.0 or greater (seems to work with 4.29, but without fp16 enabled for some reason...).
i think, my overrided trainer is good. lima training is done successfully....
@dragosconst maybe, you have to override transformers trainer...
first of all, i'm success training to using Tensor Parallel, and now, i use deepspeed ZeRO3. so, this issues is not necessary any more. so, i closed.
i used tiiuae/falcon-40b
and want to doing full fine-tuning by lima instruction dataset
model = tp.tensor_parallel(model, sharded=True)
just use like this and i have 1) A100 80GB 2 and another server 2) A100 80GB 4 but when i running code on 1) device or 2) device raised error like this
Model parameters were moved to incorrect devices, did call on model.cuda() or model.to(device)? If so, please avoid doing that
why?