Mahathi-Bhagavatula commented 1 year ago

ValueError: Could not load model databricks/dolly-v2-12b with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 'transformers.models.gpt_neox.modeling_gpt_neox.GPTNeoXForCausalLM'>).

Can you please let me know where did I go wrong?

srowen commented 1 year ago

Hm, works for me. How are you loading? maybe an out of date version of transformers?

srowen commented 1 year ago

Oh, hm, I am finding a problem with the pipeline impl for this model, that might or might not be the same issue. Hold tight. (Has to do with setting task type to instruction-following)

srowen commented 1 year ago

Could you check if adding task="text-generation" to your pipeline() call makes it work?

kostecky commented 1 year ago

I am experiencing the same thing and adding task="text-generation" does not change the error.

I'm using an M1 silicon. I found a reference to someone's post on https://news.ycombinator.com/item?id=35541861:

The error message implies that the compiled default libraries on the M1 don't support the model format, even though it works fine in Paperspace.
    The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
 Traceback (most recent call last):
   File "/Users/fragmede/projects/llm/dolly/foo.py", line 5, in <module>
  instruct_pipeline = pipeline(
       ^^^^^^^^^
   File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/transformers/pipelines/__init__.py", line 776, in pipeline
  framework, model = infer_framework_load_model(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/transformers/pipelines/base.py", line 271, in infer_framework_load_model
  raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
 ValueError: Could not load model databricks/dolly-v2-12b with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 'transformers.models.gpt_neox.modeling_gpt_neox.GPTNeoXForCausalLM'>).

If this is the case, I wonder if there is a way to get this to work on an M1 or if this is an error independent of arch?

srowen commented 1 year ago

I don't think this will run on Macs. It needs CUDA, etc. If that's the nature of this problem, sorry not going to work.

Davidy22 commented 1 year ago

Messenger pigeoning from another tracker,, but someone else and I both have had success with setting torch_dtype on linux:

import torch
from transformers import pipeline

generate_text = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

oakkas84 commented 1 year ago

I am having the same issue and the above also does not solve it.

Mahathi-Bhagavatula commented 1 year ago

1) I am using GPU with CUDA installed. Not a Mac. 2) None of the torch.dtype, task="text-generation" worked for me 3) Even loading directly from AutoModelForCausalLM, AutoTokenizer also didn't work 4) I am using transformers 4.28.0.dev0 version

LeiHao0 commented 1 year ago

It can run on m1max 64G with adding offload_folder="offload"

Something like this:

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b", device_map="auto", offload_folder="offload")

text = generate_text("Explain to me the difference between nuclear fission and fusion.")
print(text)

It turns out 10 GB swap memory is needed and a few minutes to see the text

srowen commented 1 year ago

That's probably not great, if you're having to swap. Try using a smaller model? there are 6.9B and 2.7B param models now.

kostecky commented 1 year ago

It can run on m1max 64G with adding offload_folder="offload"

Something like this:
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b", device_map="auto", offload_folder="offload")

text = generate_text("Explain to me the difference between nuclear fission and fusion.")
print(text)
It turns out 10 GB swap memory is needed and a few minutes to see the text

This is great. I wonder if it's not using the GPUs as well? Obviously swap will destroy performance. Can you try this? It looks to go further, but I don't have enough RAM and it gets killed. I'll have to test with increasing swap size. This may force it to use mps (mac GPUs), but I'm still trying to figure it out.

from transformers import pipeline, AutoModel
import torch

model = AutoModel.from_pretrained("databricks/dolly-v2-12b")
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model.to(device)
instruct_pipeline = pipeline(model=model, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16)

LeiHao0 commented 1 year ago

I'm also trying to leverage mps, unfortunately, got an error:

RuntimeError: Placeholder storage has not been allocated on MPS device!

LeiHao0 commented 1 year ago

@kostecky, python 3.11 is not supported macOS Metal yet.

You can downgrade to python 3.10 try this:

conda create -n dolly

conda install python==3.10.10
pip install tensorflow-macos==2.12.0  tensorflow-metal==0.8.0

kostecky commented 1 year ago

@LeiHao0 I am on an MBP m1pro (32GB), using python 3.10.10, and have tried using the 8B model to see if I can speed up the testing and lower memory usage. Still having some issues.

pip freeze:

accelerate==0.18.0
certifi==2022.12.7
charset-normalizer==3.1.0
filelock==3.11.0
huggingface-hub==0.13.4
idna==3.4
Jinja2==3.1.2
MarkupSafe==2.1.2
mpmath==1.3.0
networkx==3.1
numpy==1.24.2
packaging==23.1
Pillow==9.5.0
psutil==5.9.4
PyYAML==6.0
regex==2023.3.23
requests==2.28.2
sympy==1.11.1
tokenizers==0.13.3
torch==2.1.0.dev20230413
torchaudio==2.1.0.dev20230413
torchvision==0.16.0.dev20230413
tqdm==4.65.0
transformers==4.25.1
typing_extensions==4.5.0
urllib3==1.26.15

I have the following code:

from transformers import pipeline, AutoModel, AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-2-8b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-2-8b", device_map="auto", offload_folder="offload")

generate_text = pipeline(model=model, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

text = generate_text("Explain to me the difference between nuclear fission and fusion.")
print(text)

and I get the following error:

RuntimeError: Inferring the task automatically requires to check the hub with a model_id defined as a `str`.GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50280, 2560)
    (layers): ModuleList(
      (0-31): 32 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (attention): GPTNeoXAttention(
          (rotary_emb): RotaryEmbedding()
          (query_key_value): Linear(in_features=2560, out_features=7680, bias=True)
          (dense): Linear(in_features=2560, out_features=2560, bias=True)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True)
          (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True)
          (act): GELUActivation()
        )
      )
    )
    (final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
  )
  (embed_out): Linear(in_features=2560, out_features=50280, bias=False)
) is not a valid model_id.

oakkas84 commented 1 year ago

I also tried below code and got the following error:

`from transformers import pipeline, AutoModel, AutoTokenizer, AutoModelForCausalLM import torch

tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-2-8b", padding_side="left") model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-2-8b", device_map="auto", offload_folder="offload")

generate_text = pipeline(model=model, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

text = generate_text("Explain to me the difference between nuclear fission and fusion.") print(text)`

RuntimeError Traceback (most recent call last)

in 5 model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-2-8b", device_map="auto", offload_folder="offload") 6 ----> 7 generate_text = pipeline(model=model, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto") 8 9 text = generate_text("Explain to me the difference between nuclear fission and fusion.") /opt/conda/lib/python3.8/site-packages/transformers/pipelines/__init__.py in pipeline(task, model, config, tokenizer, feature_extractor, framework, revision, use_fast, use_auth_token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs) 660 if task is None and model is not None: 661 if not isinstance(model, str): --> 662 raise RuntimeError( 663 "Inferring the task automatically requires to check the hub with a model_id defined as a `str`." 664 f"{model} is not a valid model_id." RuntimeError: Inferring the task automatically requires to check the hub with a model_id defined as a `str`.GPTNeoXForCausalLM( (gpt_neox): GPTNeoXModel( (embed_in): Embedding(50280, 2560) (layers): ModuleList( (0): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (1): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (2): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (3): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (4): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (5): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (6): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (7): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (8): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (9): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (10): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (11): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (12): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (13): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (14): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (15): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (16): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (17): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (18): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (19): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (20): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (21): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (22): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (23): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (24): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (25): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (26): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (27): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (28): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (29): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (30): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) (31): GPTNeoXLayer( (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (attention): GPTNeoXAttention( (rotary_emb): RotaryEmbedding() (query_key_value): Linear(in_features=2560, out_features=7680, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) ) (mlp): GPTNeoXMLP( (dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True) (dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True) (act): GELUActivation() ) ) ) (final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) ) (embed_out): Linear(in_features=2560, out_features=50280, bias=False) ) is not a valid model_id.

oakkas84 commented 1 year ago

Latest error I got:

`--------------------------------------------------------------------------- KeyError Traceback (most recent call last)

in 2 import torch 3 ----> 4 model = AutoModel.from_pretrained("databricks/dolly-v2-12b") 5 device = torch.device("mps" if torch.backends.mps.is_available() else "cpu") 6 print(device) /opt/conda/lib/python3.6/site-packages/transformers/models/auto/auto_factory.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs) 422 if not isinstance(config, PretrainedConfig): 423 config, kwargs = AutoConfig.from_pretrained( --> 424 pretrained_model_name_or_path, return_unused_kwargs=True, trust_remote_code=trust_remote_code, **kwargs 425 ) 426 if hasattr(config, "auto_map") and cls.__name__ in config.auto_map: /opt/conda/lib/python3.6/site-packages/transformers/models/auto/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs) 670 return config_class.from_pretrained(pretrained_model_name_or_path, **kwargs) 671 elif "model_type" in config_dict: --> 672 config_class = CONFIG_MAPPING[config_dict["model_type"]] 673 return config_class.from_dict(config_dict, **kwargs) 674 else: /opt/conda/lib/python3.6/site-packages/transformers/models/auto/configuration_auto.py in __getitem__(self, key) 385 return self._extra_content[key] 386 if key not in self._mapping: --> 387 raise KeyError(key) 388 value = self._mapping[key] 389 module_name = model_type_to_module_name(key) KeyError: 'gpt_neox' `

srowen commented 1 year ago

It's saying you passed an invalid path to a model somewhere, I think. How are you loading? Just load from HF

FurkanGozukara commented 1 year ago

We cant run this on rtx 3060 12gb?

oakkas84 commented 1 year ago

Yes from HF.

FurkanGozukara commented 1 year ago

Messenger pigeoning from another tracker,, but someone else and I both have had success with setting torch_dtype on linux:
import torch
from transformers import pipeline

generate_text = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

I have this but error on rtx 3060

Is that expected

FurkanGozukara commented 1 year ago

Yes from HF.

So what is min vram and card to run this?

Can't we reduce precision to even further like int4 int8?

srowen commented 1 year ago

@FurkanGozukara can you make a separate thread please? that doesn't sound related.

@oakkas84 ah right. Can you add task="text-generation" to your pipeline(..) call and see if that resolves it? it looks like it's trying to figure out what kind of task this is (I think that's being fixed in the model config too)

FurkanGozukara commented 1 year ago

@FurkanGozukara can you make a separate thread please? that doesn't sound related.

@oakkas84 ah right. Can you add task="text-generation" to your pipeline(..) call and see if that resolves it? it looks like it's trying to figure out what kind of task this is (I think that's being fixed in the model config too)

I opened and you closed it??????

https://github.com/databrickslabs/dolly/issues/68

srowen commented 1 year ago

I believe that thread is answered by other discussions; it was a duplicate. See my response

kostecky commented 1 year ago

I resolved the original error I was getting that this issue is based on. I got it working with CPU (fully working) and GPU (semi-working). However, GPU output is half-garbled! Does anyone have insight? Both are still relatively slow, unfortunately.

Mac M1 32GB Instructions for dolly-v2-2-8B - CPU (fully working) and GPU (semi-working and verified that it uses the GPU in activity monitor)

Use Python 3.10
mkdir dolly; cd dolly
python3.10 -m venv .venv
pip -U install pip
pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

pip install -r requirements.txt where requirements.txt ==

accelerate==0.18.0
certifi==2022.12.7
charset-normalizer==3.1.0
filelock==3.11.0
huggingface-hub==0.13.4
idna==3.4
Jinja2==3.1.2
MarkupSafe==2.1.2
mpmath==1.3.0
networkx==3.1
numpy==1.24.2
packaging==23.1
Pillow==9.5.0
psutil==5.9.4
PyYAML==6.0
regex==2023.3.23
requests==2.28.2
sympy==1.11.1
tokenizers==0.13.3
torch==2.1.0.dev20230413
torchaudio==2.1.0.dev20230413
torchvision==0.16.0.dev20230413
tqdm==4.65.0
transformers==4.25.1
typing_extensions==4.5.0
urllib3==1.26.15

Make sure you download the tokenizer.json file from the dolly2.0 huggingface repo and rename it to tokenizer-8B.json or adjust the code below.

For CPU use the following code:


from transformers import pipeline, AutoModel, PreTrainedTokenizerFast, GPTNeoXForCausalLM
import torch

model = GPTNeoXForCausalLM.from_pretrained("databricks/dolly-v2-2-8b") tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer-8B.json")

instruct_pipeline = pipeline(model=model, tokenizer=tokenizer, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, task="text-generation", max_new_tokens=128)

response = instruct_pipeline("Explain to me the difference between nuclear fission and fusion.")

print(str(response))

9. When I run it:

time python test-dolly.py Setting pad_token_id to eos_token_id:0 for open-end generation. [{'generated_text': 'Explain to me the difference between nuclear fission and fusion.\n\nNuclear fission is the splitting of a heavy atom into two lighter atoms. It is caused by the collision of a very heavy particle, such as a neutron, with the nucleus of an atom. Fusion is the process by which two or more atoms join together to form a larger atom or atom nuclei. It is caused by the collision of two lighter particles, such as protons, with the nuclei of two or more atoms.\n\nNuclear fission is the splitting of a heavy atom into two lighter atoms. It is caused by the collision of a very heavy particle, such as a neutron, with the nucleus of an atom. Fusion'}]

real 1m36.291s user 3m12.309s sys 0m23.912s

10. For GPU, use the following code:

from transformers import pipeline, AutoModel, PreTrainedTokenizerFast, GPTNeoXForCausalLM import torch

model = GPTNeoXForCausalLM.from_pretrained("databricks/dolly-v2-2-8b")

device = torch.device("mps") model.to(device)

tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer-8B.json")

instruct_pipeline = pipeline(model=model, tokenizer=tokenizer, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, task="text-generation", max_new_tokens=128, device=device)

response = instruct_pipeline("Explain to me the difference between nuclear fission and fusion.")

print(str(response))

11. When I run it:

time python test-dolly.py Setting pad_token_id to eos_token_id:0 for open-end generation. /Users/kris/devel/dolly/.venv/lib/python3.10/site-packages/transformers/generation/utils.py:2338: UserWarning: MPS: no support for int64 for min_max, downcasting to a smaller data type (int32/float32). Native support for int64 has been added in macOS 13.3. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/ReduceOps.mm:610.) if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores): [{'generated_text': 'Explain to me the difference between nuclear fission and fusion.\nFossilisation\n\nFissie, and Reduce it.\n\n\nNifungu will be c n the process calleduclear fission, and the last is the process called nuclear and the process. The process ofcombination of the two twoa the thestodeepro\n\n\n\n\nf that is. of the way of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of'}]

real 1m14.183s user 0m38.674s sys 0m12.271s



Perhaps someone with MPS, mac GPU knowledge, and using this hardware with LLMs can explain why the GPUs mess the output up so much and don't really give a significant speed boost? @LeiHao0 this may interest you too.

I also noticed that when significantly extending the `max_new_tokens=` parameter to something like 2048 it will take forever and my mouse will start going nuts intermittently and moving all over the screen in a very random fashion with the buttons triggering randomly too. It's spooky, but I suspect some strange bug with the GPUs being used that corresponds to the bad output.

srowen commented 1 year ago

This is great, I'm going to close though as it's moved beyond the original question I think.

mikev-db commented 1 year ago

@kostecky Try using nightly pytorch and transformers. I had the same issue with gibberish output, and if I recall this PR in transformers fixed it for me: https://github.com/huggingface/transformers/pull/22908

dlowe commented 1 year ago

FWIW: I hit the original issue (ValueError: Could not load model databricks/dolly-v2-12b with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 'transformers.models.gpt_neox.modeling_gpt_neox.GPTNeoXForCausalLM'>).) using transformers 4.28.1. It appears to have been caused by a full disk: the pytorch_model.bin download was incomplete as a result of the full disk. Freeing up space and nuking the huggingface cache to re-download resolved the issue.

rkique commented 1 year ago

Was able to get dolly running successfully on GPU (no gibberish) mostly by following @kostecky's comment on Macbook Air M2 Ventura 13.4, 16GB RAM and with the nightly pytorch & transformers build, per @mikev-db. Found it necessary to add PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.5 to prevent MPS backend out of memory. Seems like you need python 3.10, Ventura 13.3+, and nightly builds for it to work.

Takes 1 minute 16 seconds. With xformers you save ~10 seconds.

response = instruct_pipeline("Explain to me the difference between eukaryotes and prokaryotes")
print(str(response))

[{'generated_text': 'Explain to me the difference between eukaryotes and prokaryotes, and how they differ from plants and animals.\nI understand that eukaryotes are unicellular, and prokaryotes are pro-\n\nA:\n\nEukaryotes are cells, and prokaryotes are not.\nEukaryotes have a cell nucleus, prokaryotes do not.\nEukaryotes have a cell membrane, prokaryotes do not.\nEukaryotes have organelles, prokaryotes do not.\n Eukaryotes have a cytoskeleton, prokaryotes do not.\nEukaryotes have a cell wall, prokaryotes do'}]

Note that the output is wrong :p

surya-narayanan commented 1 year ago

similar error, dunno why

ValueError: Could not load model stabilityai/StableBeluga-7B with any of the following classes: (, ).

ChenaLee commented 1 year ago

In my case, nvidia driver wasn't running correctly. Followed this to re-install https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html#nvidia-GRID-driver

databrickslabs / dolly

Could not load model with AutoModelForCasualLM dolly-v2-12b #60

Mac M1 32GB Instructions for dolly-v2-2-8B - CPU (fully working) and GPU (semi-working and verified that it uses the GPU in activity monitor)