convert-hf-to-ggml.py CUDA out of memory

i've tried to convert model from HF to GGML format:

python3 convert-hf-to-ggml.py ../starcoderbase_int8

and got an error:

Loading model:  ../starcoderbase_int8
Loading checkpoint shards:   ...
Traceback (most recent call last):
  File "/home/alex/starcoder/starcoder.cpp/convert-hf-to-ggml.py", line 58, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16 if use_f16 else torch.float32, low_cpu_mem_usage=True, trust_remote_code=True, offload_state_dict=True)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2901, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3258, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 725, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(model, param_name, param_device, value=param, fp16_statistics=fp16_statistics)
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/bitsandbytes.py", line 109, in set_module_quantized_tensor_to_device
    new_value = value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 576.00 MiB (GPU 0; 10.90 GiB total capacity; 9.21 GiB already allocated; 568.69 MiB free; 9.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

And next I've tried to force it to run on the CPU:

export CUDA_VISIBLE_DEVICES=""
python3 convert-hf-to-ggml.py ../starcoderbase_int8

Then, I got this:

Loading model:  ../starcoderbase_int8
Traceback (most recent call last):
  File "/home/alex/starcoder/starcoder.cpp/convert-hf-to-ggml.py", line 58, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_name, config=config, torch_dtype=torch.float16 if use_f16 else torch.float32, low_cpu_mem_usage=True, trust_remote_code=True, offload_state_dict=True)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 493, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2370, in from_pretrained
    raise RuntimeError("No GPU found. A GPU is needed for quantization.")
RuntimeError: No GPU found. A GPU is needed for quantization.

For me, the main reason to go with GGML implementation is that I can't fit the model in my GPU. I thought I could perform both the conversion and inference using only the CPU and system RAM. Am I doing something specific wrong or I got it wrong in general?

bigcode-project / starcoder.cpp

convert-hf-to-ggml.py CUDA out of memory #35