epfLLM / meditron

Meditron is a suite of open-source medical Large Language Models (LLMs).
https://huggingface.co/epfl-llm
Apache License 2.0
1.77k stars 159 forks source link

Issue with generation with standard HF generation #8

Closed NtaylorOX closed 7 months ago

NtaylorOX commented 7 months ago

Great work and repo - however there is a tokenizer issue with the base version of the model.

When trying to just simple prompt the base model to do something with the suggested format, it runs into cuda issues which seem to indicate weird tokenizer/embedding mismatches

Working example:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "epfl-llm/meditron-7b"

# BitsAndBytesConfig int-4 config 
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, use_cache=False, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(model_id)

def format_prompt(prompt):

    system_msg = "You are a helpful, respectful and honest assistant." + \
    "Always answer as helpfully as possible, while being safe." + \
    "Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content." + \
    "Please ensure that your responses are socially unbiased and positive in nature.\n\n" + \
    "If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct." + \
    "If you don't know the answer to a question, please don't share false information."

    system_msg = "You are a helpful, respectful and honest assistant."

    return f"<|im_start|> system\n{system_msg}<|im_end|>\n <|im_start|> user\n{prompt}<|im_end|>\n <|im_start|> assistant\n"

 med_prompt = format_prompt("What is a possible treatment for high blood pressure in a pregnant woman?")

Gives us this prompt:

'<|im_start|> system\nYou are a helpful, respectful and honest assistant.<|im_end|>\n <|im_start|> user\nmake a clinical note<|im_end|>\n <|im_start|> assistant\n'

Use vanilla HF pipeline:

# Use a pipeline for later
from transformers import pipeline

pipe = pipeline("text-generation",
                model=model,
                tokenizer= tokenizer,    
                max_new_tokens = 1024,
                do_sample=True,
                top_k=30,
                num_return_sequences=2,
                eos_token_id=tokenizer.eos_token_id,
                return_full_text=False,
                )

# generate from prompt
generated = pipe(med_prompt)

Leads to:

../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [642,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

But it all works fine if the special formatting is not provided. I understand the special formatting was only for the finetuned versions, but the tokenizer has these special tokens added for the base model too, which seems problematic.

I hope this is enough detail to go on, but its throwing me a bit - seems like the special tokens do not play nice.

Envrionment details:

Python 3.9

Pip packages:

Package Version


accelerate 0.20.3 aiofiles 23.2.1 aiohttp 3.8.4 aiosignal 1.3.1 altair 5.1.2 annotated-types 0.6.0 anyio 3.7.1 asttokens 2.2.1 async-timeout 4.0.2 attrs 23.1.0 backcall 0.2.0 bertopic 0.16.0 blis 0.7.11 catalogue 2.0.10 certifi 2023.5.7 charset-normalizer 3.1.0 click 8.1.7 cloudpathlib 0.16.0 cmake 3.26.4 colorama 0.4.6 comm 0.1.3 confection 0.1.4 contourpy 1.2.0 cycler 0.12.1 cymem 2.0.8 Cython 0.29.36 datasets 2.13.1 debugpy 1.6.7 decorator 5.1.1 dill 0.3.6 einops 0.6.1 en-core-web-sm 3.7.1 exceptiongroup 1.1.3 executing 1.2.0 fastapi 0.104.1 fastjsonschema 2.19.0 ffmpy 0.3.1 filelock 3.12.2 fonttools 4.44.0 frozenlist 1.3.3 fsspec 2023.6.0 gradio 4.2.0 gradio_client 0.7.0 h11 0.14.0 hdbscan 0.8.33 httpcore 1.0.2 httpx 0.25.1 huggingface-hub 0.15.1 idna 3.4 importlib-metadata 6.7.0 importlib-resources 6.1.1 ipykernel 6.23.3 ipython 8.14.0 jedi 0.18.2 Jinja2 3.1.2 joblib 1.3.2 jsonschema 4.19.2 jsonschema-specifications 2023.7.1 jupyter_client 8.3.0 jupyter_core 5.3.1 kiwisolver 1.4.5 langcodes 3.3.0 lit 16.0.6 llvmlite 0.41.1 markdown-it-py 3.0.0 MarkupSafe 2.1.3 matplotlib 3.8.1 matplotlib-inline 0.1.6 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.14 murmurhash 1.0.10 nbformat 5.9.2 nest-asyncio 1.5.6 networkx 3.1 nltk 3.8.1 numba 0.58.1 numpy 1.25.0 nvidia-cublas-cu11 11.10.3.66 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu11 8.5.0.96 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu11 10.9.0.58 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu11 10.2.10.91 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu11 11.7.4.91 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu11 2.14.3 nvidia-nccl-cu12 2.18.1 nvidia-nvjitlink-cu12 12.3.52 nvidia-nvtx-cu11 11.7.91 nvidia-nvtx-cu12 12.1.105 orjson 3.9.10 packaging 23.1 pandas 2.0.3 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 Pillow 10.1.0 pip 23.1.2 platformdirs 3.8.0 plotly 5.18.0 preshed 3.0.9 prompt-toolkit 3.0.38 psutil 5.9.5 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 12.0.1 pydantic 2.4.2 pydantic_core 2.10.1 pydub 0.25.1 Pygments 2.15.1 pynndescent 0.5.11 pyparsing 3.1.1 python-dateutil 2.8.2 python-multipart 0.0.6 pytz 2023.3 PyYAML 6.0 pyzmq 25.1.0 referencing 0.30.2 regex 2023.6.3 requests 2.31.0 rich 13.6.0 rpds-py 0.12.0 safetensors 0.3.1 scikit-learn 1.3.2 scipy 1.11.4 semantic-version 2.10.0 sentence-transformers 2.2.2 sentencepiece 0.1.99 setuptools 58.1.0 shellingham 1.5.4 six 1.16.0 smart-open 6.4.0 sniffio 1.3.0 spacy 3.7.2 spacy-legacy 3.0.12 spacy-loggers 1.0.5 srsly 2.4.8 stack-data 0.6.2 starlette 0.27.0 sympy 1.12 tenacity 8.2.3 thinc 8.2.1 threadpoolctl 3.2.0 tokenizers 0.13.3 tomlkit 0.12.0 toolz 0.12.0 torch 2.1.1 torchvision 0.16.1 tornado 6.3.2 tqdm 4.65.0 traitlets 5.9.0 transformers 4.30.2 triton 2.1.0 typer 0.9.0 typing_extensions 4.8.0 tzdata 2023.3 umap-learn 0.5.5 urllib3 2.0.3 uvicorn 0.24.0.post1 wasabi 1.1.2 wcwidth 0.2.6 weasel 0.3.4 websockets 11.0.3 wheel 0.40.0 xxhash 3.2.0 yarl 1.9.2 zipp 3.15.0

NtaylorOX commented 7 months ago

Actually - I am noticing a slightly odd thing that may explain, although it would be really weird.

I am loading using the id: "epfl-llm/meditron-7b" - note the lower case b. Rather than "epfl-llm/meditron-7B" - but the latter tells me this is gated and I don't have access. But according to HF I do.

The discrepancy seems to be on your repo its upper case B, which leads to gating issues. But on huggingface its "epfl-llm/meditron-7b", which works fine.

NtaylorOX commented 7 months ago

So I think we can assume this is due to the tokenizer being mismatched with the base model? The model currently loaded with the above code has an embedding size of 32000, but the tokenizer has several tokens added. Presumably for the supervised fine-tuned versions?

eric11eca commented 7 months ago

Hi! Thanks for posting the issue with vanilla HF!

Yes, we noticed that several people are facing the same issue with the special tokens.

So, the '<|im_start|> system\n{system_msg}.<|im_end|>\n <|im_start|> user\n{prompt}<|im_end|>\n <|im_start|> assistant\n' format is meant for our finetuned models, but meditron-7b and meditron-70b are pretrained models that are not finetuned. In this case, the tokenizer mistakenly includes the additional special tokens.

We updated the tokenizer model and its related config files by removing the additional special tokens. Let us know if the fix resolves the issue with the special token.

Also, thank you for catching the incorrect repo_name in re README! Should be fixed now.

NtaylorOX commented 7 months ago

That's great. And sorry for seemingly posting as you were replying! I think we all came to same conclusion at the same time.

I'll test out tomorrow and close comment if all seems good.

NtaylorOX commented 7 months ago

Hi, Thanks again for the insights, and I can see the tokenizer has been updated for the base model, although the updated tokenizer config is still actually including tokens it shouldn't for the base model:

At the moment it still adds the following:

LlamaTokenizerFast(name_or_path='epfl-llm/meditron-7b', vocab_size=32000, ... added_tokens_decoder={ 0: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),

32000: AddedToken("<CLS>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32001: AddedToken("<SEP>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32002: AddedToken("<EOD>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32003: AddedToken("<MASK>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32004: AddedToken("<PAD>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),

}

But the model only has embeddings for 32,000 and its known vocab size is similarly 32,000. So if you try to use the PAD token with the base model, it runs into the same problems described above.

Will keep this issue open for now if that is okay.

eric11eca commented 7 months ago

Hi! We just updated the model and tokenizer with a consistent vocab size (32017). Let us know if the issue has been lifted this time. Thanks!

NtaylorOX commented 7 months ago

Hi! Thanks for staying on top of this . I just checked and yes, it seems the base model is now aligned with the tokenizer and tokens no longer throw any erros. We can close for now :)