Closed Mahathi-Bhagavatula closed 1 year ago
Hm, works for me. How are you loading? maybe an out of date version of transformers
?
Oh, hm, I am finding a problem with the pipeline
impl for this model, that might or might not be the same issue. Hold tight. (Has to do with setting task type to instruction-following
)
Could you check if adding task="text-generation"
to your pipeline()
call makes it work?
I am experiencing the same thing and adding task="text-generation"
does not change the error.
I'm using an M1 silicon. I found a reference to someone's post on https://news.ycombinator.com/item?id=35541861:
The error message implies that the compiled default libraries on the M1 don't support the model format, even though it works fine in Paperspace.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Traceback (most recent call last):
File "/Users/fragmede/projects/llm/dolly/foo.py", line 5, in <module>
instruct_pipeline = pipeline(
^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/transformers/pipelines/__init__.py", line 776, in pipeline
framework, model = infer_framework_load_model(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/transformers/pipelines/base.py", line 271, in infer_framework_load_model
raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
ValueError: Could not load model databricks/dolly-v2-12b with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 'transformers.models.gpt_neox.modeling_gpt_neox.GPTNeoXForCausalLM'>).
If this is the case, I wonder if there is a way to get this to work on an M1 or if this is an error independent of arch?
I don't think this will run on Macs. It needs CUDA, etc. If that's the nature of this problem, sorry not going to work.
Messenger pigeoning from another tracker,, but someone else and I both have had success with setting torch_dtype
on linux:
import torch
from transformers import pipeline
generate_text = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
I am having the same issue and the above also does not solve it.
1) I am using GPU with CUDA installed. Not a Mac. 2) None of the torch.dtype, task="text-generation" worked for me 3) Even loading directly from AutoModelForCausalLM, AutoTokenizer also didn't work 4) I am using transformers 4.28.0.dev0 version
It can run on m1max 64G with adding offload_folder="offload"
Something like this:
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b", device_map="auto", offload_folder="offload")
text = generate_text("Explain to me the difference between nuclear fission and fusion.")
print(text)
It turns out 10 GB swap memory is needed and a few minutes to see the text
That's probably not great, if you're having to swap. Try using a smaller model? there are 6.9B and 2.7B param models now.
It can run on m1max 64G with adding
offload_folder="offload"
Something like this:
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b", padding_side="left") model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b", device_map="auto", offload_folder="offload") text = generate_text("Explain to me the difference between nuclear fission and fusion.") print(text)
It turns out 10 GB swap memory is needed and a few minutes to see the text
This is great. I wonder if it's not using the GPUs as well? Obviously swap will destroy performance. Can you try this? It looks to go further, but I don't have enough RAM and it gets killed. I'll have to test with increasing swap size. This may force it to use mps (mac GPUs), but I'm still trying to figure it out.
from transformers import pipeline, AutoModel
import torch
model = AutoModel.from_pretrained("databricks/dolly-v2-12b")
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model.to(device)
instruct_pipeline = pipeline(model=model, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16)
I'm also trying to leverage mps
, unfortunately, got an error:
RuntimeError: Placeholder storage has not been allocated on MPS device!
@kostecky, python 3.11 is not supported macOS Metal yet.
You can downgrade to python 3.10 try this:
conda create -n dolly
conda install python==3.10.10
pip install tensorflow-macos==2.12.0 tensorflow-metal==0.8.0
@LeiHao0 I am on an MBP m1pro (32GB), using python 3.10.10, and have tried using the 8B model to see if I can speed up the testing and lower memory usage. Still having some issues.
pip freeze:
accelerate==0.18.0
certifi==2022.12.7
charset-normalizer==3.1.0
filelock==3.11.0
huggingface-hub==0.13.4
idna==3.4
Jinja2==3.1.2
MarkupSafe==2.1.2
mpmath==1.3.0
networkx==3.1
numpy==1.24.2
packaging==23.1
Pillow==9.5.0
psutil==5.9.4
PyYAML==6.0
regex==2023.3.23
requests==2.28.2
sympy==1.11.1
tokenizers==0.13.3
torch==2.1.0.dev20230413
torchaudio==2.1.0.dev20230413
torchvision==0.16.0.dev20230413
tqdm==4.65.0
transformers==4.25.1
typing_extensions==4.5.0
urllib3==1.26.15
I have the following code:
from transformers import pipeline, AutoModel, AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-2-8b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-2-8b", device_map="auto", offload_folder="offload")
generate_text = pipeline(model=model, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
text = generate_text("Explain to me the difference between nuclear fission and fusion.")
print(text)
and I get the following error:
RuntimeError: Inferring the task automatically requires to check the hub with a model_id defined as a `str`.GPTNeoXForCausalLM(
(gpt_neox): GPTNeoXModel(
(embed_in): Embedding(50280, 2560)
(layers): ModuleList(
(0-31): 32 x GPTNeoXLayer(
(input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
(post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
(attention): GPTNeoXAttention(
(rotary_emb): RotaryEmbedding()
(query_key_value): Linear(in_features=2560, out_features=7680, bias=True)
(dense): Linear(in_features=2560, out_features=2560, bias=True)
)
(mlp): GPTNeoXMLP(
(dense_h_to_4h): Linear(in_features=2560, out_features=10240, bias=True)
(dense_4h_to_h): Linear(in_features=10240, out_features=2560, bias=True)
(act): GELUActivation()
)
)
)
(final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
)
(embed_out): Linear(in_features=2560, out_features=50280, bias=False)
) is not a valid model_id.
I also tried below code and got the following error:
`from transformers import pipeline, AutoModel, AutoTokenizer, AutoModelForCausalLM import torch
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-2-8b", padding_side="left") model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-2-8b", device_map="auto", offload_folder="offload")
generate_text = pipeline(model=model, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
text = generate_text("Explain to me the difference between nuclear fission and fusion.") print(text)`
RuntimeError Traceback (most recent call last)
Latest error I got:
`--------------------------------------------------------------------------- KeyError Traceback (most recent call last)
It's saying you passed an invalid path to a model somewhere, I think. How are you loading? Just load from HF
We cant run this on rtx 3060 12gb?
Yes from HF.
Messenger pigeoning from another tracker,, but someone else and I both have had success with setting
torch_dtype
on linux:import torch from transformers import pipeline generate_text = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
I have this but error on rtx 3060
Is that expected
Yes from HF.
So what is min vram and card to run this?
Can't we reduce precision to even further like int4 int8?
@FurkanGozukara can you make a separate thread please? that doesn't sound related.
@oakkas84 ah right. Can you add task="text-generation"
to your pipeline(..)
call and see if that resolves it? it looks like it's trying to figure out what kind of task this is (I think that's being fixed in the model config too)
@FurkanGozukara can you make a separate thread please? that doesn't sound related.
@oakkas84 ah right. Can you add
task="text-generation"
to yourpipeline(..)
call and see if that resolves it? it looks like it's trying to figure out what kind of task this is (I think that's being fixed in the model config too)
I opened and you closed it??????
I believe that thread is answered by other discussions; it was a duplicate. See my response
I resolved the original error I was getting that this issue is based on. I got it working with CPU (fully working) and GPU (semi-working). However, GPU output is half-garbled! Does anyone have insight? Both are still relatively slow, unfortunately.
accelerate==0.18.0
certifi==2022.12.7
charset-normalizer==3.1.0
filelock==3.11.0
huggingface-hub==0.13.4
idna==3.4
Jinja2==3.1.2
MarkupSafe==2.1.2
mpmath==1.3.0
networkx==3.1
numpy==1.24.2
packaging==23.1
Pillow==9.5.0
psutil==5.9.4
PyYAML==6.0
regex==2023.3.23
requests==2.28.2
sympy==1.11.1
tokenizers==0.13.3
torch==2.1.0.dev20230413
torchaudio==2.1.0.dev20230413
torchvision==0.16.0.dev20230413
tqdm==4.65.0
transformers==4.25.1
typing_extensions==4.5.0
urllib3==1.26.15
tokenizer-8B.json
or adjust the code below.
from transformers import pipeline, AutoModel, PreTrainedTokenizerFast, GPTNeoXForCausalLM
import torch
model = GPTNeoXForCausalLM.from_pretrained("databricks/dolly-v2-2-8b") tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer-8B.json")
instruct_pipeline = pipeline(model=model, tokenizer=tokenizer, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, task="text-generation", max_new_tokens=128)
response = instruct_pipeline("Explain to me the difference between nuclear fission and fusion.")
print(str(response))
9. When I run it:
time python test-dolly.py
Setting pad_token_id
to eos_token_id
:0 for open-end generation.
[{'generated_text': 'Explain to me the difference between nuclear fission and fusion.\n\nNuclear fission is the splitting of a heavy atom into two lighter atoms. It is caused by the collision of a very heavy particle, such as a neutron, with the nucleus of an atom. Fusion is the process by which two or more atoms join together to form a larger atom or atom nuclei. It is caused by the collision of two lighter particles, such as protons, with the nuclei of two or more atoms.\n\nNuclear fission is the splitting of a heavy atom into two lighter atoms. It is caused by the collision of a very heavy particle, such as a neutron, with the nucleus of an atom. Fusion'}]
real 1m36.291s user 3m12.309s sys 0m23.912s
10. For GPU, use the following code:
from transformers import pipeline, AutoModel, PreTrainedTokenizerFast, GPTNeoXForCausalLM import torch
model = GPTNeoXForCausalLM.from_pretrained("databricks/dolly-v2-2-8b")
device = torch.device("mps") model.to(device)
tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer-8B.json")
instruct_pipeline = pipeline(model=model, tokenizer=tokenizer, trust_remote_code=True, device_map="auto", torch_dtype=torch.bfloat16, task="text-generation", max_new_tokens=128, device=device)
response = instruct_pipeline("Explain to me the difference between nuclear fission and fusion.")
print(str(response))
11. When I run it:
time python test-dolly.py
Setting pad_token_id
to eos_token_id
:0 for open-end generation.
/Users/kris/devel/dolly/.venv/lib/python3.10/site-packages/transformers/generation/utils.py:2338: UserWarning: MPS: no support for int64 for min_max, downcasting to a smaller data type (int32/float32). Native support for int64 has been added in macOS 13.3. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/ReduceOps.mm:610.)
if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):
[{'generated_text': 'Explain to me the difference between nuclear fission and fusion.\nFossilisation\n\nFissie, and Reduce it.\n\n\nNifungu will be c n the process calleduclear fission, and the last is the process called nuclear and the process. The process ofcombination of the two twoa the thestodeepro\n\n\n\n\nf that is. of the way of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of'}]
real 1m14.183s user 0m38.674s sys 0m12.271s
Perhaps someone with MPS, mac GPU knowledge, and using this hardware with LLMs can explain why the GPUs mess the output up so much and don't really give a significant speed boost? @LeiHao0 this may interest you too.
I also noticed that when significantly extending the `max_new_tokens=` parameter to something like 2048 it will take forever and my mouse will start going nuts intermittently and moving all over the screen in a very random fashion with the buttons triggering randomly too. It's spooky, but I suspect some strange bug with the GPUs being used that corresponds to the bad output.
This is great, I'm going to close though as it's moved beyond the original question I think.
@kostecky Try using nightly pytorch and transformers. I had the same issue with gibberish output, and if I recall this PR in transformers fixed it for me: https://github.com/huggingface/transformers/pull/22908
FWIW: I hit the original issue (ValueError: Could not load model databricks/dolly-v2-12b with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 'transformers.models.gpt_neox.modeling_gpt_neox.GPTNeoXForCausalLM'>).
) using transformers 4.28.1. It appears to have been caused by a full disk: the pytorch_model.bin download was incomplete as a result of the full disk. Freeing up space and nuking the huggingface cache to re-download resolved the issue.
Was able to get dolly running successfully on GPU (no gibberish) mostly by following @kostecky's comment on Macbook Air M2 Ventura 13.4, 16GB RAM and with the nightly pytorch & transformers build, per @mikev-db. Found it necessary to add PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.5
to prevent MPS backend out of memory. Seems like you need python 3.10, Ventura 13.3+, and nightly builds for it to work.
Takes 1 minute 16 seconds. With xformers you save ~10 seconds.
response = instruct_pipeline("Explain to me the difference between eukaryotes and prokaryotes")
print(str(response))
[{'generated_text': 'Explain to me the difference between eukaryotes and prokaryotes, and how they differ from plants and animals.\nI understand that eukaryotes are unicellular, and prokaryotes are pro-\n\nA:\n\nEukaryotes are cells, and prokaryotes are not.\nEukaryotes have a cell nucleus, prokaryotes do not.\nEukaryotes have a cell membrane, prokaryotes do not.\nEukaryotes have organelles, prokaryotes do not.\n Eukaryotes have a cytoskeleton, prokaryotes do not.\nEukaryotes have a cell wall, prokaryotes do'}]
Note that the output is wrong :p
similar error, dunno why
ValueError: Could not load model stabilityai/StableBeluga-7B with any of the following classes: (, ).
In my case, nvidia driver wasn't running correctly. Followed this to re-install https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html#nvidia-GRID-driver
ValueError: Could not load model databricks/dolly-v2-12b with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class 'transformers.models.gpt_neox.modeling_gpt_neox.GPTNeoXForCausalLM'>).
Can you please let me know where did I go wrong?