Support for Apple silicon

rickardp commented 1 year ago

Would it make sense for this library to support platforms other than cuda on x64 Linux? I am specifically looking for Apple silicon support. Currently not even cpuonly works since it assumes SSE2 support (Even without Neon. Support).

i would guess that the first step would be a full cross platform compile (arm64), then ideally support for Metal Performance Shaders as an alternative to CUDA (assuming it is at all feasible).

I could probably contribute some towards support if there is interest for bitsandbytes to be multi platform. I have some experience setting up cross platform Python libraries.

TheStoneMX commented 1 year ago

Hi there, I will contribute too, in order to get it to work on Metal Apple M1

this is my trace:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
dlopen([/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so](https://file+.vscode-resource.vscode-cdn.net/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so), 0x0006): tried: '[/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so](https://file+.vscode-resource.vscode-cdn.net/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so)' (not a mach-o file), '[/System/Volumes/Preboot/Cryptexes/OS/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so](https://file+.vscode-resource.vscode-cdn.net/System/Volumes/Preboot/Cryptexes/OS/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so)' (no such file), '[/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so](https://file+.vscode-resource.vscode-cdn.net/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so)' (not a mach-o file)
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
dlopen([/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so](https://file+.vscode-resource.vscode-cdn.net/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so), 0x0006): tried: '[/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so](https://file+.vscode-resource.vscode-cdn.net/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so)' (not a mach-o file), '[/System/Volumes/Preboot/Cryptexes/OS/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so](https://file+.vscode-resource.vscode-cdn.net/System/Volumes/Preboot/Cryptexes/OS/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so)' (no such file), '[/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so](https://file+.vscode-resource.vscode-cdn.net/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so)' (not a mach-o file)
[/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/cextension.py:31](https://file+.vscode-resource.vscode-cdn.net/Users/raziel/miniconda3/envs/nlp/lib/python3.9/site-packages/bitsandbytes/cextension.py:31): UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
--------------------------------------------------------------------------------------------
# What version of Python do you have?
import sys
import platform
import torch

has_gpu = torch.cuda.is_available()
has_mps = getattr(torch,'has_mps',False)
print('has_mps', has_mps)
device = "mps" if getattr(torch,'has_mps',False) \
    else "gpu" if torch.cuda.is_available() else "cpu"

print(f"Python Platform: {platform.platform()}")
print(f"PyTorch Version: {torch.__version__}")
print()
print(f"Python {sys.version}")
print("GPU is", "available" if has_gpu else "NOT AVAILABLE")
print("MPS (Apple Metal) is", "AVAILABLE" if has_mps else "NOT AVAILABLE")
print(f"Target device is {device}")
----------------------------------------------------------------------------------
has_mps True
Python Platform: macOS-13.3-arm64-arm-64bit
PyTorch Version: 2.0.0

Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:38:11) 
[Clang 14.0.6 ]
GPU is NOT AVAILABLE
MPS (Apple Metal) is AVAILABLE
Target device is mps

rickardp commented 1 year ago

Nice to hear! It would be good to hear from the maintainers that they are at all interested in making this package cross-platform. It is very much CUDA focused at the moment.

Getting libbitsandbytes_cpu.so to compile for macOS arm64 was not at all difficult, just an exercise in moving around some #ifdefs, but CPU support would obviously need to add Neon (SIMD) to make any sense IMHO. Then, of course the MPS support would be needed at one point (though I expect it's quite a lot more work).

I've just started looking at the unit tests and the Python libraries.

The C++ code is quite nicely structured, but the Python code would need some refactoring since most of the calls assume CUDA (x.cuda() instead of x.to(device), etc). Also, since the CPU version does not cover 100% of the feature set, testing is going to be quite some work as there is no real baseline. I suppose one question is if it would make sense to make the CPU cover 100% of the API calls, even if inefficient, just to provide a baseline that the GPU implementations could compare against?

If pursuing this, I propose implementing cross-platform CPU support first, then tackling MPS. MPS is of course what makes it useful.

(I have the exact same setup BTW, 2021 MBP)

Edit: Specifically, here's how I imagine the unit tests would have to work https://github.com/TimDettmers/bitsandbytes/pull/257/files#diff-659bad232c71219167252c1a5ccbc427b6f54925b78741df18613c3c49aaa4c1R153

So at least one CPU test pass on my M1 Mac :)

janrinze commented 1 year ago

please have a look at Building on Jetson AGX Xavier Development Kit fails #221 It addresses the same AArch64 issue but on CUDA supported platforms like NVidia Jetson.

UserHIJ commented 1 year ago

Wow .. not to be inflammatory , but are we saying that there's no immediate solution for this if you have any macbook in the last like .. 5 years? Yuck.

janrinze commented 1 year ago

https://en.wikipedia.org/wiki/Apple_M1 introduced less than 3 years ago. Things take time in the world of open-source. Specially when using hardware such as Apple.

KotlinFactory commented 1 year ago

when will this be done?

benjaminhuo commented 1 year ago

Would it make sense for this library to support platforms other than cuda on x64 Linux? I am specifically looking for Apple silicon support. Currently not even cpuonly works since it assumes SSE2 support (Even without Neon. Support).

i would guess that the first step would be a full cross platform compile (arm64), then ideally support for Metal Performance Shaders as an alternative to CUDA (assuming it is at all feasible).

I could probably contribute some towards support if there is interest for bitsandbytes to be multi platform. I have some experience setting up cross platform Python libraries.

Looking forward to the support for this too, got the below errors when I tried to fine-tune llama2 7B with load_in_8bit=True enabled on my Macbook M2, PyTorch‘s support to MPS is getting better and I hope this project could support this as well:

  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 293, in forward
    using_igemmlt = supports_igemmlt(A.device) and not state.force_no_igemmlt
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 226, in supports_igemmlt
    if torch.cuda.get_device_capability(device=device) < (7, 5):
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/cuda/__init__.py", line 381, in get_device_capability
    prop = get_device_properties(device)
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/cuda/__init__.py", line 395, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

AlexandreCassagne commented 1 year ago

@benjaminhuo Getting the same issue as you.

id4thomas commented 1 year ago

  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 293, in forward
    using_igemmlt = supports_igemmlt(A.device) and not state.force_no_igemmlt
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 226, in supports_igemmlt
    if torch.cuda.get_device_capability(device=device) < (7, 5):
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/cuda/__init__.py", line 381, in get_device_capability
    prop = get_device_properties(device)
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/cuda/__init__.py", line 395, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/Users/ben/opt/miniconda3/envs/finetune/lib/python3.10/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

https://github.com/TimDettmers/bitsandbytes/blob/18e827d666fa2b70a12d539ccedc17aa51b2c97c/bitsandbytes/autograd/_functions.py#L227

This seems to be due to calling torch.cuda even if the device type isn't cuda. One way to patch these unchecked torch.cuda calls is adding device checks like

if device.type != 'cuda':
    return False

mps returns "mps" as device.type

pechaut78 commented 11 months ago

same issue here, MPS seems to be the problem

ProjectProgramAMark commented 11 months ago

getting same issue with apple silicon. would love to see some support for it soon!

ivan-digital commented 10 months ago

Same issue. Would be nice to have support for MPS.

ageorgios commented 10 months ago

Same here, please have support for MPS https://github.com/ml-explore/mlx

592319702 commented 9 months ago

(torch-gpu) I542464@DY4GPKX1J0 test % python3 fine_tune_llama_2_in_google_colab.py /Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " 'NoneType' object has no attribute 'cadam32bit_grad_fp32' Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:32<00:00, 16.06s/it] /Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/peft/utils/other.py:102: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead. warnings.warn( /Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:159: UserWarning: You didn't pass amax_seq_lengthargument to the SFTTrainer, this will default to 1024 warnings.warn( 0%| | 0/250 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using thecallmethod is faster than using a method to encode the text followed by a call to thepadmethod to get a padded encoding. use_cache=Trueis incompatible with gradient checkpointing. Settinguse_cache=False... /Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( FP4 quantization state not initialized. Please call .cuda() or .to(device) on the LinearFP4 layer first. Traceback (most recent call last): File "/Users/I542464/test/fine_tune_llama_2_in_google_colab.py", line 229, in <module> trainer.train() File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/trainer.py", line 2654, in training_step loss = self.compute_loss(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/trainer.py", line 2679, in compute_loss outputs = model(**inputs) ^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/peft/peft_model.py", line 922, in forward return self.base_model( ^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward outputs = self.model( ^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward layer_outputs = torch.utils.checkpoint.checkpoint( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/_compile.py", line 24, in inner return torch._dynamo.disable(fn, recursive)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/_dynamo/external_utils.py", line 17, in inner return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 451, in checkpoint return CheckpointFunction.apply(function, preserve, *args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/autograd/function.py", line 539, in apply return super().apply(*args, **kwargs) # type: ignore[misc] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 230, in forward outputs = run_function(*args) ^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward return module(*inputs, output_attentions, None) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( ^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 305, in forward query_states = self.q_proj(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/peft/tuners/lora.py", line 1123, in forward result = super().forward(x) ^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/bitsandbytes/nn/modules.py", line 221, in forward out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/I542464/miniconda3/envs/torch-gpu/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py", line 567, in matmul_4bit assert quant_state is not None ^^^^^^^^^^^^^^^^^^^^^^^ AssertionError 0%| | 0/250 [00:01<?, ?it/s]

mbtre commented 8 months ago

+1 MPS support would be absolutely great!

morkapronczay commented 8 months ago

adding a comment to keep this alive. MPS support would be awesome!

rickardp commented 8 months ago

Once the device abstraction has been been merged, we can start adding MPS-accelerated versions of the functions

Satyam7166-tech commented 8 months ago

Once the device abstraction has been been merged, we can start adding MPS-accelerated versions of the functions

Yay. Thanks to all your efforts. One a side note: how can someone be skilled enough to contribute to this stuff? Like what topics should they cover?

sislam-provenir commented 8 months ago

Looking forward to MPS support!

anilkul98 commented 7 months ago

Looking forward to MPS Support!!!!

JohnSilverman commented 7 months ago

looking forward to mps support

Allisterlim commented 7 months ago

Looking forward to mps support!

svnv-svsv-jm commented 7 months ago

Please support MPS.

ashwinrachha786 commented 7 months ago

Please support MPS. Looking forward to it.

Titus-von-Koeller commented 7 months ago

Hey everyone, we're committed to enabling Apple Silicon support. There's a lot of ongoing work to get out of the way to lay the groundwork for this.

We'll keep you posted. Thanks for your interest and support of BNB 🤗

sislam-provenir commented 7 months ago

For the time being, for those on Apple Silicon, who wants to get unblocked asap: you can use MLX to run HuggingFace models locally with GPU and shared memory architecture support.

The mlx-examples repo is a good place to start as it contains:

scripts that download models from HuggingFace and converts the weight tensors to MLX format.
integration between MLX Model and HuggingFace tokenizers using AutoTokenizer.from_pretrained
Supports model fine-tuning with LoRA and quantization (QLoRA)

Akossimon commented 4 months ago

adding a comment to keep this alive. MPS support would be awesome!

.... some other blogs i found on this topic, i hope if this can be maybe of interest to anyone here:

solution 1 ?: Option to choose device (mps instead of cuda) in lora training on Mac Silicon M1 #577 https://github.com/bmaltais/kohya_ss/issues/577
solution 2 ? : https://www.reddit.com/r/StableDiffusion/comments/140c3z0/train_a_lora_on_mac_os/
- solution 3 ? : June 20, 2023 …. Training Stable Diffusion LoRA on Apple Silicon M2 Mac GPUs (Metal) Environment install https://www.reallyar.com/training-stable-diffusion-lora-on-apple-silicon-m2-mac-gpus-metal/
solution 4 ?: Mac M2 Lora accelerate config https://www.bilibili.com/read/cv26915695/
solution 5 ?: https://huggingface.co/docs/diffusers/en/optimization/mps Metal Performance Shaders (MPS)
solution 6 ?: he states: ..."After many research, I finally found a way of installing an access to the gpu from pytorch as mentioned on this page : https://towardsdatascience.com/installing-pytorch-on-apple-m1-chip-with-gpu-acceleration-3351dc44d67c 35 in order to be able to use the GPU on my macbook. I now have a True return to the following : print(torch.backends.mps.is_built()) which…" ...

although i have no clue if these links can lead to any help in this at all...

dasdipanjan04 commented 4 months ago

Same problem

UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
'NoneType' object has no attribute 'cadam32bit_grad_fp32'

while trying to use Blip2ForConditionalGeneration.from_pretrained with BitsAndBytesConfig(load_in_8bit=True) obviously getting the following error as it currently does not support usage of M1 or M2.

ImportError: Using `bitsandbytes` 8-bit quantization requires Accelerate: `pip install accelerate` 
and the latest version of bitsandbytes: `pip install -U bitsandbytes`

After I tried to test with the following modification in the quantizer_bnb_8bit.py by applying torch.backends.mps.is_available()

def validate_environment(self, *args, **kwargs):
        if not torch.cuda.is_available() and not torch.backends.mps.is_available():
            raise RuntimeError("No GPU found. A GPU is needed for quantization.")
        if not (is_accelerate_available() and is_bitsandbytes_available()):
            raise ImportError(
                "Using `bitsandbytes` 8-bit quantization requires Accelerate: `pip install accelerate` "
                "and the latest version of bitsandbytes: `pip install -U bitsandbytes`"
            )

       .....
       .....

I still get the same error that:

ImportError: Using `bitsandbytes` 8-bit quantization requires Accelerate: `pip install accelerate` 
and the latest version of bitsandbytes: `pip install -U bitsandbytes`

I already have accelerate and bitsandbytes installed but I guess the error is obviously occurring due to the fact that transformers is not compiled for M2 architecture.

Does anyone have any workaround to circumvent the problem?

Nevertheless, keeping this thread alive so that we can get M1/M2 support from @huggingface