Closed cocoshe closed 1 month ago
Hi @cocoshe , this is intended here so that you can enable tensor parallel when accelerate process is 1. When you use process > 1, the accelerator handle the device and distributed the model on each device.
Hi @cocoshe , this is intended here so that you can enable tensor parallel when accelerate process is 1. When you use process > 1, the accelerator handle the device and distributed the model on each device.
Tks for your reply~, but I'm curious if we set num_processes > 1(which means we want to use multi gpus?), since the device_map is set to be something like cuda:0
, can we load the model with multi gpus by passing such a specific device_map
to from_pretrained
?
Hi @cocoshe , this is intended here so that you can enable tensor parallel when accelerate process is 1. When you use process > 1, the accelerator handle the device and distributed the model on each device.
I think I know what you mean, when the num_processes = 1, it means we would like to pass the device_map
to decide the strategy my ourselves manually, and when the num_processes > 1, it means we would like to use accelerate to handle the strategy automatically, so we don't need to pass the device_map
. Is that how you design the code?
Hi @cocoshe , this is intended here so that you can enable tensor parallel when accelerate process is 1. When you use process > 1, the accelerator handle the device and distributed the model on each device.
I think I know what you mean, when the num_processes = 1, it means we would like to pass the
device_map
to decide the strategy my ourselves manually, and when the num_processes > 1, it means we would like to use accelerate to handle the strategy automatically, so we don't need to pass thedevice_map
. Is that how you design the code?
Yes, when we set num_process > 1, each device will hold one model weights and run parallel. Accelerate will handle the device index for you. If you do not have enough memory to hold the model, then you can set num_process = 1 and device map = auto to enable tensor parallel or any other device map you want
Hi @cocoshe , this is intended here so that you can enable tensor parallel when accelerate process is 1. When you use process > 1, the accelerator handle the device and distributed the model on each device.
I think I know what you mean, when the num_processes = 1, it means we would like to pass the
device_map
to decide the strategy my ourselves manually, and when the num_processes > 1, it means we would like to use accelerate to handle the strategy automatically, so we don't need to pass thedevice_map
. Is that how you design the code?Yes, when we set num_process > 1, each device will hold one model weights and run parallel. Accelerate will handle the device index for you. If you do not have enough memory to hold the model, then you can set num_process = 1 and device map = auto to enable tensor parallel or any other device map you want
Tks for your explanation and great work~
And I just committed for the reproduce doc, including a typo and the device_map
args is needed since the num_processes is 1 in the script, or there will be a error like the issue I mentioned before, and you also helped him to solve the problem:
So it is better to add the device_map
to the script, or it may confuse some followers if they don't know much about accelerate and transformers library.
When reproduce the llava model evaluation, some errors may occur like this issue:
This is because we don't pass the
device_map
according to the repr script here: https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/9dfb53afb49adb6d20607bc3ca989591969f5432/miscs/repr_scripts.sh#L14which leads to the error in llava model code: https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/9dfb53afb49adb6d20607bc3ca989591969f5432/lmms_eval/models/llava.py#L68-L73
It seems that when
num_processes
is >1, it run into theif
branch to select a specific cuda device,cuda:0
for example. and if thenum_processes
is 1 (that's how we do according to the reproduce script), it need user to define a device_map strategy,auto
for example. And if we don't pass thedevice_map
, it's default value is ""(empty).So, the logic is really weird, actually when we try to use multi-gpu, we need to set a device_map, and if we only use one gpu(num_processes=1), it should be located on a specific cuda device.
BTW, there's also a typo:
--log_samples_sufix
and--log_samples_suffix
in the script doc.