MLX inference error with BFloat16

beebopkim commented 9 months ago

from commit hash 6a49341eb6d0b9f06eb490dc285abe69e3bef5fc:

What I have done was executing generate.py for a mamba fine-tuned model - kuotient/mamba-ko-2.8b, and below error was happened. How can I deal with this error?

(venv_mamba_py) ******@Mac-Studio-2022-01 scripts % python generate.py --prompt="Mamba is a type of" --hf_model_name="kuotient/mamba-ko-2.8b" --n_tokens=100

Traceback (most recent call last):
  File "/Users/******/test/mamba.py/mlx/scripts/generate.py", line 31, in <module>
    model = MambaLM.from_pretrained(args.hf_model_name)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/******/test/mamba.py/mlx/mamba_lm_mlx.py", line 150, in from_pretrained
    mlx_state_dict = map_mambassm_torch_to_mlx(state_dict)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/******/test/mamba.py/mlx/utils.py", line 53, in map_mambassm_torch_to_mlx
    return map_mambapy_torch_to_mlx(new_state_dict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/******/test/mamba.py/mlx/utils.py", line 37, in map_mambapy_torch_to_mlx
    new_state_dict[key] = value.numpy()
                          ^^^^^^^^^^^^^
TypeError: Got unsupported ScalarType BFloat16

My environments are:

M1 Max Mac Studio 2022
- 10 CPUs, 32 GPUs, with 64GB of memory(up to 48GB of VRAM)
macOS Sonoma 14.2.1
Python 3.11.7 from homebrew
- torch 2.3.0.dev20240126 from https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
- mlx 0.0.11
- transformers 4.37.1

alxndrTL commented 9 months ago

Hello, it seems the problem arises during the conversion of the weights (from torch to MLX) : the weights are converted (temporary) to numpy. (must be, cf MLX docs) However, numpy doesn't support bfloat16, so it throws an error.

To avoid this, we will work with float16 in numpy.

First, in utils.py, in the ‎map_mambapy_torch_to_mlx function, replace this line :

new_state_dict[key] = value.numpy()

by this line :

new_state_dict[key] = value.half().numpy()

This should save the .npz weight file in float16. When loading the file in the MLX Mamba model, it should also load it as float16.

For inference with bfloat16, MLX doesn't seem to support it yet : https://ml-explore.github.io/mlx/build/html/python/data_types.html?highlight=types

Hope this helps!

beebopkim commented 9 months ago

I run the generate.py with your suggestion, and an another error was happened.

Traceback (most recent call last):
  File "/home/******/test/mamba.py/mlx/scripts/generate.py", line 31, in <module>
    model = MambaLM.from_pretrained(args.hf_model_name)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/******/test/mamba.py/mlx/mamba_lm_mlx.py", line 150, in from_pretrained
    mlx_state_dict = map_mambassm_torch_to_mlx(state_dict)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/******/test/mamba.py/mlx/utils.py", line 53, in map_mambassm_torch_to_mlx
    return map_mambapy_torch_to_mlx(new_state_dict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/******/test/mamba.py/mlx/utils.py", line 32, in map_mambapy_torch_to_mlx
    value = torch_to_mlx_depthwise_weights(value)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/******/test/mamba.py/mlx/misc.py", line 101, in torch_to_mlx_depthwise_weights
    mlx_weights[indices, :, indices] = torch_weights[:, :, 0]
    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Index put requires the source and destination dtypes match, got Float for the destination and BFloat16 for the source.

I changed the line 101 of misc.py like below:

    mlx_weights[indices, :, indices] = torch_weights[:, :, 0].float()

and run it again, then it looked like working but had some problems.

(venv_mamba_py) ******@Mac-Studio-2022-01 scripts % python generate.py --prompt="Mamba is a type of" --hf_model_name="kuotient/mamba-ko-2.8b" --n_tokens=100

Mamba is a type of condition that causes temperature���� 发��를 �������니다. lieber15,ock, & &   liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation liberation%  
(venv_mamba_py) ******@Mac-Studio-2022-01 scripts % python generate.py --prompt="내가 살던 고향은" --hf_model_name="kuotient/mamba-ko-2.8b" --n_tokens=100

��가 ������ ������은 ����� ��은 �������� 이�������이 ����� ���는 �������니다. ��는 ��� ��� 하����니다. 하지�� ��는 �������� 이�������이 �����한 �����을 ������%

Actually the model kuotient/mamba-ko-2.8b is a Korean fine-tuned, and its output with mlx looks totally broken.

On the other hand, when I run an example on the CUDA environment, it worked very well.

(venv_mamba_py) ******@************:~/test/mamba.py$ python example_mamba_llm.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
내가 살던 고향은 스위스입니다. 스위스는 알프스 산맥에 위치한 작은 나라이지만 아름다운 자연과 풍부한 문화를 자랑합니다. 저는 스위스에서 다양한 농업을 경험했는데, 농업은 스위스 경제에서 중요한 역할을 합니다. 스위스의 주요 농업 산물은 알프스 산맥에서 재배되는 밀, 호밀, 크루팅단테와 chihai와 같은 곡물과 이탈리아 튜브스테이크, 스위스 햄, 프라스타로니친 등의 유제품입니다. 스위스는 또한 유기농 농업과 유제품 수출로 유명합니다. 농업은 스위스의 지역 사회에 다양한 영향을 미칩니다. 농업은 일자리를 창출하고 지역 사회에 수입을 제공합니다. 또한 농업은 식량 안보를 증진하고 환경을 보호하는 데 기여합니다.

스위스 정부는 지역 농업을 지원하기 위해 다양한 정책을 시행하고 있습니다. 이러한 정책에는 농업 연구 개발에 대한 지원, 농업 프로젝트에 대한 보조금 제공, 농산물의 판매를 촉진하기 위한 마케팅 캠페인 등이 포함됩니다. 또한 100명의 직업 교육 화학을 처음 시작한 유기농 생산자들은 자신 만의 성공 이야기를 들려주었습니다. 그들은 처음에는 시작하기 위해 많은 노력을 기울여야 했으며, 많은 어려움을 겪었지만, 점차적으로 성공을 거두었습니다. 그들이 성공하기 위해서는 처음에는 농업에 대한 강한 열정과 끈기가 필요했습니다. 또한 지원해줄 수 있는 여러 교육 기관과 다른 농부들로
(venv_mamba_py) ******@************:~/test/mamba.py$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
(venv_mamba_py) ******@************:~/test/mamba.py$

# example_mamba_llm.py
import torch

from transformers import AutoTokenizer

from mamba_lm import from_pretrained

device = "cuda" if torch.cuda.is_available() else "cpu"

#kuotient/mamba-ko-2.8b
#model = from_pretrained('state-spaces/mamba-130m').to(device)

model = from_pretrained('kuotient/mamba-ko-2.8b').to(device)
tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neox-20b')

output = model.generate(tokenizer, "내가 살던 고향은", num_tokens=1000)

print(output)

alxndrTL commented 9 months ago

Thanks for the details. The output you're seeing is weird, have you tried prompting the model with korean like you did in example_mamba_llm.py ? I noticed that you called generate.py with the english prompt "Mamba is a type of"...

If this doesn't still work and you keep seeing these characters : � , maybe its due to the printing ? Or some UTF-8 stuff ? In generate.py, I simply wrote :

print(token, end='', flush=True)

but maybe this doesn't support printing korean characters ?

beebopkim commented 9 months ago

As I investigated, I found that the only differences on tokenizers between EleutherAI/gpt-neox-20b and kuotient/mamba-ko-2.8b are that kuotient/mamba-ko-2.8 added a special token '[PAD]' ("id": 50277) and its tokenizer model settings were different in only one place against EleutherAI/gpt-neox-20b: "byte_fallback": false. So I guess the tokenizer model in example_mamba_llm.py for my linux machine is just okay for both EleutherAI/gpt-neox-20b and kuotient/mamba-ko-2.8.

And on my Mac Studio with MLX, korean prompt made a broken result too. I changed tokenizer model as EleutherAI/gpt-neox-20b and tried again, it was still broken. The environment variable LC_CTYPE is set as UTF-8, so if the character codes printed by generate.py fit in UTF-8, terminal will render the codes into readable character images instead of printing question marks.

(venv_mamba_py) ******@Mac-Studio-2022-01 scripts % python generate.py --prompt="내가 살던 고향은" --hf_model_name="kuotient/mamba-ko-2.8b" --n_tokens=100

��가 ������ ������은 ������고 ������한 ���이����니다. �������� ����� ���, ���을 ��고 �������� �����로 ���어서�� �������니다. ��� ��� ��는 �����한 ���������� �%                                                                                                                                                       (venv_mamba_py) ******@Mac-Studio-2022-01 scripts %

alxndrTL commented 9 months ago

That's very weird that the tokenizers behave differently from your linux machine to your Mac ... interesting.. We don't even know if MLX has something to do with it... probably not Maybe, if we want to be sure that the tokenizer is at fault here, you can check the differences between the tokenizer in linux and Mac :

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neox-20b')

tokenizer.encode(some_string)

with some korean string, and see if the token ids are the same..

If they are the same, you can try passing the same prompt through the model (torch and MLX) and compare the output token ids (with setting sample=False to get the same results).

(i currently don't have access to a Silicon equipped Mac)

beebopkim commented 9 months ago

I tinkered mamba_lm_mlx.py to get generated token IDs before decoding, acquired an token ID array from Apple Silicon, put into my Linux machine, and tried to decode with the tokenizer. It produced a flawless korean string, huh???

I investigated generate functions in mamba_lm_mlx.py and mamba_lm.py, and found that the generate function in mamba_lm_mlx.py uses yield, the generator in Python.

So, I removed the generator in the generate function, amended generate.py and the final results is this:

(venv_mamba_py) ******@Mac-Studio-2022-01 scripts % python generate.py --prompt="내가 살던 고향은" --hf_model_name="kuotient/mamba-ko-2.8b" --n_tokens=100
내가 살던 고향은 세상에서 가장 아름다운 곳이었습니다. 그것은 맑은 물, 푸른 하늘, 그리고 숨이 멎을 만큼 아름다운 산과 숲으로 가득 차 있었습니다.
(venv_mamba_py) ******@Mac-Studio-2022-01 scripts %

It looks like that the tokenizer.decode produces characters byte by byte, but when used by the Python generator, it seems to be unable to handle multibyte UTF-8 characters like korean, japanese, chinese, thai, cambodian, hindi, and etc. Those characters are rendered in 3 byte length in UTF-8.

alxndrTL commented 9 months ago

Oh, ok, it makes sense now. I used the python generator yield similar to what was done in the MLX docs (inference training example) to have a script that streams tokens instead of printing them all at once at the end. So the printing must NOT be done token by token.

alxndrTL / mamba.py

MLX inference error with BFloat16 #6