evo-design / evo

Biological foundation modeling from molecular to genome scale
Apache License 2.0
933 stars 113 forks source link

issue on model.to("cuda") with device_map="auto" #61

Open UmutAlihan opened 4 months ago

UmutAlihan commented 4 months ago

Hi,

I am having below error, while trying to load model on my 2x RTX 3060 using device_map="auto" param:

File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/accelerate/utils/modeling.py:1395, in check_device_map(model, device_map) 1393 if len(all_model_tensors) > 0: 1394 non_covered_params = ", ".join(all_model_tensors) -> 1395 raise ValueError( 1396 f"The device_map provided does not give any device for the following parameters: {non_covered_params}" 1397 ) ValueError: The device_map provided does not give any device for the following parameters: backbone.unembed.weight

my code is:

In [2]: from transformers import AutoConfig, AutoModelForCausalLM
   ...:
   ...: model_name = 'togethercomputer/evo-1-8k-base'
   ...: #model_name = "togethercomputer/evo-1-131k-base"
   ...:
   ...: model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True, revision="1.1_fix")
   ...: model_config.use_cache = True
   ...:
   ...: model = AutoModelForCausalLM.from_pretrained(
   ...:     model_name,
   ...:     config=model_config,
   ...:     trust_remote_code=True,
   ...:     revision="1.1_fix",
   ...:     cache_dir="/llms/evo",
   ...:     low_cpu_mem_usage=True,
   ...:     device_map="auto". ## only updated here from repo code, so that it distributes the weights to multiple GPUs 
   ...: )

What would be the root cause here and possible solution approaches?

Any help is much appreciated. Thanks

Here you can check out the whole stderr output:

Loading checkpoint shards: 100%|████████████████| 3/3 [00:03<00:00, 1.11s/it] Some weights of StripedHyenaModelForCausalLM were not initialized from the model checkpoint at togethercomputer/evo-1-8k-base and are newly initialized: ['backbone.unembed.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[1], line 9 6 model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True, revision="1.1_fix") 7 model_config.use_cache = True ----> 9 model = AutoModelForCausalLM.from_pretrained( 10 model_name, 11 config=model_config, 12 trust_remote_code=True, 13 revision="1.1_fix", 14 cache_dir="/media/raid/llms/evo", 15 low_cpu_mem_usage=True, 16 device_map="auto" 17 ) File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:558, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs) 556 else: 557 cls.register(config.__class__, model_class, exist_ok=True) --> 558 return model_class.from_pretrained( 559 pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs 560 ) 561 elif type(config) in cls._model_mapping.keys(): 562 model_class = _get_model_class(config, cls._model_mapping) File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/transformers/modeling_utils.py:3820, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs) 3818 device_map_kwargs["force_hooks"] = True 3819 if not is_fsdp_enabled() and not is_deepspeed_zero3_enabled(): -> 3820 dispatch_model(model, **device_map_kwargs) 3822 if hf_quantizer is not None: 3823 hf_quantizer.postprocess_model(model) File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/accelerate/big_modeling.py:351, in dispatch_model(model, device_map, main_device, state_dict, offload_dir, offload_index, offload_buffers, skip_keys, preload_module_classes, force_hooks) 317 """ 318 Dispatches a model according to a given device map. Layers of the model might be spread across GPUs, offloaded on 319 the CPU or even the disk. (...) 348 single device. 349 """ 350 # Error early if the device map is incomplete. --> 351 check_device_map(model, device_map) 353 # for backward compatibility 354 is_bnb_quantized = ( 355 getattr(model, "is_quantized", False) or getattr(model, "is_loaded_in_8bit", False) 356 ) and getattr(model, "quantization_method", "bitsandbytes") == "bitsandbytes" File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/accelerate/utils/modeling.py:1419, in check_device_map(model, device_map) 1417 if len(all_model_tensors) > 0: 1418 non_covered_params = ", ".join(all_model_tensors) -> 1419 raise ValueError( 1420 f"The device_map provided does not give any device for the following parameters: {non_covered_params}" 1421 ) ValueError: The device_map provided does not give any device for the following parameters: backbone.unembed.weight
mbi2gs commented 4 months ago

I ran into the same issue. For some reason the backbone.unembed.weight parameters are not included in the default device map. I got it working with a custom device map like the following:

def make_new_device_map(num_devices:int, out_map_file:str):
    # Read in default device map as basis for new
    with open(DEFAULT_DEVICE_MAP, 'r') as indm:
        device_map = json.load(indm)

    # Distribute evenly across as many devices as available
    # Count all blocks
    device_modules = {}
    device_list = list(range(num_devices))
    for layer_name in device_map.keys():
        module = '.'.join(layer_name.split('.')[:3])
        device_modules[module] = None
    device_modules['backbone.unembed'] = None
    num_modules = len([x for x in device_modules.keys()])

    # Assign blocks to devices
    even_split = num_modules / num_devices
    for i, key in enumerate(device_modules.keys()):
        cur_device_idx = int(np.floor(i / even_split))
        device_modules[key] = cur_device_idx

    # Assign individual layers to devices (all within a block share same device)
    for layer_name in device_map.keys():
        module = '.'.join(layer_name.split('.')[:3])
        device_map[layer_name] = device_modules[module]
    device_map['backbone.unembed.weight'] = device_modules['backbone.unembed']

    with open(out_map_file, 'w') as outdm:
        json.dump(device_map, outdm)

And then you supply the new json device map to the load_checkpoint_and_dispatch() function.