Open UmutAlihan opened 4 months ago
I ran into the same issue. For some reason the backbone.unembed.weight
parameters are not included in the default device map. I got it working with a custom device map like the following:
def make_new_device_map(num_devices:int, out_map_file:str):
# Read in default device map as basis for new
with open(DEFAULT_DEVICE_MAP, 'r') as indm:
device_map = json.load(indm)
# Distribute evenly across as many devices as available
# Count all blocks
device_modules = {}
device_list = list(range(num_devices))
for layer_name in device_map.keys():
module = '.'.join(layer_name.split('.')[:3])
device_modules[module] = None
device_modules['backbone.unembed'] = None
num_modules = len([x for x in device_modules.keys()])
# Assign blocks to devices
even_split = num_modules / num_devices
for i, key in enumerate(device_modules.keys()):
cur_device_idx = int(np.floor(i / even_split))
device_modules[key] = cur_device_idx
# Assign individual layers to devices (all within a block share same device)
for layer_name in device_map.keys():
module = '.'.join(layer_name.split('.')[:3])
device_map[layer_name] = device_modules[module]
device_map['backbone.unembed.weight'] = device_modules['backbone.unembed']
with open(out_map_file, 'w') as outdm:
json.dump(device_map, outdm)
And then you supply the new json device map to the load_checkpoint_and_dispatch()
function.
Hi,
I am having below error, while trying to load model on my 2x RTX 3060 using device_map="auto" param:
my code is:
What would be the root cause here and possible solution approaches?
Any help is much appreciated. Thanks
Here you can check out the whole stderr output: