allenai / OLMo

Modeling, training, eval, and inference code for OLMo
https://allenai.org/olmo
Apache License 2.0
4.24k stars 399 forks source link

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. #581

Closed mclanza closed 2 months ago

mclanza commented 2 months ago

🐛 Describe the bug

Following instructions here https://github.com/allenai/OLMo used to work some days ago. I tried it yesterday and today again and it throws the following warning/error You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. and no results are shown as it used to show.

This is the code that I'm trying to run

from hf_olmo import * # registers the Auto* classes

from transformers import AutoModelForCausalLM, AutoTokenizer

olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-7B")
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B")

message = ["Language modeling is "]
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])

I also tried with tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-7B", trust_remote_code=True) but same message is shown.

I also tried allenai/olmo-7b (lowercase) and same message is shown.

What am I missing? What could have changed since the last time that it worked?

Thanks!

Versions

python --version && pip freeze
Python 3.9.6 ai2-olmo==0.3.0 antlr4-python3-runtime==4.9.3 boto3==1.34.107 botocore==1.34.107 cached_path==1.6.2 cachetools==5.3.3 certifi==2024.2.2 charset-normalizer==3.3.2 filelock==3.13.4 fsspec==2024.5.0 google-api-core==2.19.0 google-auth==2.29.0 google-cloud-core==2.4.1 google-cloud-storage==2.16.0 google-crc32c==1.5.0 google-resumable-media==2.7.0 googleapis-common-protos==1.63.0 huggingface-hub==0.21.4 idna==3.7 Jinja2==3.1.4 jmespath==1.0.1 markdown-it-py==3.0.0 MarkupSafe==2.1.5 mdurl==0.1.2 mpmath==1.3.0 networkx==3.2.1 numpy==1.26.4 omegaconf==2.3.0 packaging==24.0 pillow==10.3.0 proto-plus==1.23.0 protobuf==4.25.3 pyasn1==0.6.0 pyasn1_modules==0.4.0 Pygments==2.18.0 python-dateutil==2.9.0.post0 PyYAML==6.0.1 regex==2024.5.15 requests==2.31.0 rich==13.7.1 rsa==4.9 s3transfer==0.10.1 safetensors==0.4.3 six==1.16.0 sympy==1.12 tokenizers==0.19.1 torch==2.0.0 torchaudio==2.0.1 torchvision==0.15.1 tqdm==4.66.4 transformers==4.40.2 typing_extensions==4.11.0 urllib3==1.26.18

2015aroras commented 2 months ago

Hi Marie,

We integrated OLMo directly into the transformers library recently, which caused us to change the checkpoints to use. You will probably want to use allenai/OLMo-7B-hf instead, or allenai/OLMo-1.7-7B-hf if you would like to use our recently-released improved 7B model. I have updated the README with the new information: https://github.com/allenai/OLMo/pull/589. Sorry for the inconvenience.

mclanza commented 2 months ago

Thanks @2015aroras !

I tried it with the following code

from transformers import AutoModelForCausalLM, AutoTokenizer

olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-1.7-7B-hf")
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-1.7-7B-hf", trust_remote_code=True)

message = ["Language modeling is "]
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])

and it hangs after loading checkpoint shards

image

I tried it myself and some colleagues tried it too, on different machines, different environments and different networks. Any idea that could help us?

Thanks!


Versions

python --version && pip freeze
Python 3.9.6 ai2-olmo==0.3.0 antlr4-python3-runtime==4.9.3 boto3==1.34.111 botocore==1.34.111 cached_path==1.6.2 cachetools==5.3.3 certifi==2024.2.2 charset-normalizer==3.3.2 filelock==3.13.4 fsspec==2024.5.0 google-api-core==2.19.0 google-auth==2.29.0 google-cloud-core==2.4.1 google-cloud-storage==2.16.0 google-crc32c==1.5.0 google-resumable-media==2.7.0 googleapis-common-protos==1.63.0 huggingface-hub==0.21.4 idna==3.7 Jinja2==3.1.4 jmespath==1.0.1 markdown-it-py==3.0.0 MarkupSafe==2.1.5 mdurl==0.1.2 mpmath==1.3.0 networkx==3.2.1 numpy==1.26.4 omegaconf==2.3.0 packaging==24.0 pillow==10.3.0 proto-plus==1.23.0 protobuf==4.25.3 pyasn1==0.6.0 pyasn1_modules==0.4.0 Pygments==2.18.0 python-dateutil==2.9.0.post0 PyYAML==6.0.1 regex==2024.5.15 requests==2.32.2 rich==13.7.1 rsa==4.9 s3transfer==0.10.1 safetensors==0.4.3 six==1.16.0 sympy==1.12 tokenizers==0.19.1 torch==2.2.2 torchaudio==2.2.2 torchvision==0.17.2 tqdm==4.66.4 transformers==4.40.2 typing_extensions==4.11.0 urllib3==1.26.18

mclanza commented 2 months ago

It hangs here: response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)


Code:

from transformers import AutoModelForCausalLM, AutoTokenizer

print('Step 1...')
olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-1.7-7B-hf")

print('Step 2...')
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-1.7-7B-hf")

message = ["Language modeling is "]
print('Step 3...')
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)

print('Step 4...')
response = olmo.generate(**inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

print('Step 5...')
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])

Result:

image

The same happens with allenai/OLMo-7B-hf and with/withour argument trust_remote_code=True.

2015aroras commented 2 months ago

The trust_remote_code=True is not needed for our -hf models.

Without looking too deeply, my guess is that this is trying to run on CPUs (since nothing is telling it to run on GPUs). The simplest solution would be to pass , device_map="auto" to AutoModelForCausalLM.from_pretrained (and pip installing accelerate if needed). You may also add a .to("cuda") to the end of your inputs to move them to GPUs too. Other than that, you can set max_new_tokens=1 just to see if things are hanging or the model is just being slow.

Aside: the model might not fit on just 1 GPU.

mclanza commented 2 months ago

Exactly! Thanks for your reply! It took much more time (hours) but it finally worked. I'll try your suggestions. Thanks again!

image
mclanza commented 2 months ago

Thanks for the detailed replies. It worked ok now. I'm closing the issue. Thanks again!