Loading model: ../starcoderbase_int8
Loading checkpoint shards: ...
Traceback (most recent call last):
File "/home/alex/starcoder/starcoder.cpp/convert-hf-to-ggml.py", line 58, in <module>
model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16 if use_f16 else torch.float32, low_cpu_mem_usage=True, trust_remote_code=True, offload_state_dict=True)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2901, in from_pretrained
) = cls._load_pretrained_model(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3258, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 725, in _load_state_dict_into_meta_model
set_module_quantized_tensor_to_device(model, param_name, param_device, value=param, fp16_statistics=fp16_statistics)
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/bitsandbytes.py", line 109, in set_module_quantized_tensor_to_device
new_value = value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 576.00 MiB (GPU 0; 10.90 GiB total capacity; 9.21 GiB already allocated; 568.69 MiB free; 9.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
And next I've tried to force it to run on the CPU:
Loading model: ../starcoderbase_int8
Traceback (most recent call last):
File "/home/alex/starcoder/starcoder.cpp/convert-hf-to-ggml.py", line 58, in <module>
model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16 if use_f16 else torch.float32, low_cpu_mem_usage=True, trust_remote_code=True, offload_state_dict=True)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2370, in from_pretrained
raise RuntimeError("No GPU found. A GPU is needed for quantization.")
RuntimeError: No GPU found. A GPU is needed for quantization.
For me, the main reason to go with GGML implementation is that I can't fit the model in my GPU. I thought I could perform both the conversion and inference using only the CPU and system RAM. Am I doing something specific wrong or I got it wrong in general?
i've tried to convert model from HF to GGML format:
and got an error:
And next I've tried to force it to run on the CPU:
Then, I got this:
For me, the main reason to go with GGML implementation is that I can't fit the model in my GPU. I thought I could perform both the conversion and inference using only the CPU and system RAM. Am I doing something specific wrong or I got it wrong in general?