Closed NtaylorOX closed 7 months ago
Actually - I am noticing a slightly odd thing that may explain, although it would be really weird.
I am loading using the id: "epfl-llm/meditron-7b" - note the lower case b. Rather than "epfl-llm/meditron-7B" - but the latter tells me this is gated and I don't have access. But according to HF I do.
The discrepancy seems to be on your repo its upper case B, which leads to gating issues. But on huggingface its "epfl-llm/meditron-7b", which works fine.
So I think we can assume this is due to the tokenizer being mismatched with the base model? The model currently loaded with the above code has an embedding size of 32000, but the tokenizer has several tokens added. Presumably for the supervised fine-tuned versions?
Hi! Thanks for posting the issue with vanilla HF!
Yes, we noticed that several people are facing the same issue with the special tokens.
So, the '<|im_start|> system\n{system_msg}.<|im_end|>\n <|im_start|> user\n{prompt}<|im_end|>\n <|im_start|> assistant\n'
format is meant for our finetuned models, but meditron-7b
and meditron-70b
are pretrained models that are not finetuned. In this case, the tokenizer mistakenly includes the additional special tokens.
We updated the tokenizer model and its related config files by removing the additional special tokens. Let us know if the fix resolves the issue with the special token.
Also, thank you for catching the incorrect repo_name in re README! Should be fixed now.
That's great. And sorry for seemingly posting as you were replying! I think we all came to same conclusion at the same time.
I'll test out tomorrow and close comment if all seems good.
Hi, Thanks again for the insights, and I can see the tokenizer has been updated for the base model, although the updated tokenizer config is still actually including tokens it shouldn't for the base model:
At the moment it still adds the following:
LlamaTokenizerFast(name_or_path='epfl-llm/meditron-7b', vocab_size=32000,
...
added_tokens_decoder={
0: AddedToken("
32000: AddedToken("<CLS>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32001: AddedToken("<SEP>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32002: AddedToken("<EOD>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32003: AddedToken("<MASK>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
32004: AddedToken("<PAD>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
But the model only has embeddings for 32,000 and its known vocab size is similarly 32,000. So if you try to use the PAD token with the base model, it runs into the same problems described above.
Will keep this issue open for now if that is okay.
Hi! We just updated the model and tokenizer with a consistent vocab size (32017). Let us know if the issue has been lifted this time. Thanks!
Hi! Thanks for staying on top of this
. I just checked and yes, it seems the base model is now aligned with the tokenizer and
Great work and repo - however there is a tokenizer issue with the base version of the model.
When trying to just simple prompt the base model to do something with the suggested format, it runs into cuda issues which seem to indicate weird tokenizer/embedding mismatches
Working example:
Gives us this prompt:
Use vanilla HF pipeline:
Leads to:
But it all works fine if the special formatting is not provided. I understand the special formatting was only for the finetuned versions, but the tokenizer has these special tokens added for the base model too, which seems problematic.
I hope this is enough detail to go on, but its throwing me a bit - seems like the special tokens do not play nice.
Envrionment details:
Python 3.9
Pip packages:
Package Version
accelerate 0.20.3 aiofiles 23.2.1 aiohttp 3.8.4 aiosignal 1.3.1 altair 5.1.2 annotated-types 0.6.0 anyio 3.7.1 asttokens 2.2.1 async-timeout 4.0.2 attrs 23.1.0 backcall 0.2.0 bertopic 0.16.0 blis 0.7.11 catalogue 2.0.10 certifi 2023.5.7 charset-normalizer 3.1.0 click 8.1.7 cloudpathlib 0.16.0 cmake 3.26.4 colorama 0.4.6 comm 0.1.3 confection 0.1.4 contourpy 1.2.0 cycler 0.12.1 cymem 2.0.8 Cython 0.29.36 datasets 2.13.1 debugpy 1.6.7 decorator 5.1.1 dill 0.3.6 einops 0.6.1 en-core-web-sm 3.7.1 exceptiongroup 1.1.3 executing 1.2.0 fastapi 0.104.1 fastjsonschema 2.19.0 ffmpy 0.3.1 filelock 3.12.2 fonttools 4.44.0 frozenlist 1.3.3 fsspec 2023.6.0 gradio 4.2.0 gradio_client 0.7.0 h11 0.14.0 hdbscan 0.8.33 httpcore 1.0.2 httpx 0.25.1 huggingface-hub 0.15.1 idna 3.4 importlib-metadata 6.7.0 importlib-resources 6.1.1 ipykernel 6.23.3 ipython 8.14.0 jedi 0.18.2 Jinja2 3.1.2 joblib 1.3.2 jsonschema 4.19.2 jsonschema-specifications 2023.7.1 jupyter_client 8.3.0 jupyter_core 5.3.1 kiwisolver 1.4.5 langcodes 3.3.0 lit 16.0.6 llvmlite 0.41.1 markdown-it-py 3.0.0 MarkupSafe 2.1.3 matplotlib 3.8.1 matplotlib-inline 0.1.6 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.4 multiprocess 0.70.14 murmurhash 1.0.10 nbformat 5.9.2 nest-asyncio 1.5.6 networkx 3.1 nltk 3.8.1 numba 0.58.1 numpy 1.25.0 nvidia-cublas-cu11 11.10.3.66 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu11 8.5.0.96 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu11 10.9.0.58 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu11 10.2.10.91 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu11 11.7.4.91 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu11 2.14.3 nvidia-nccl-cu12 2.18.1 nvidia-nvjitlink-cu12 12.3.52 nvidia-nvtx-cu11 11.7.91 nvidia-nvtx-cu12 12.1.105 orjson 3.9.10 packaging 23.1 pandas 2.0.3 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 Pillow 10.1.0 pip 23.1.2 platformdirs 3.8.0 plotly 5.18.0 preshed 3.0.9 prompt-toolkit 3.0.38 psutil 5.9.5 ptyprocess 0.7.0 pure-eval 0.2.2 pyarrow 12.0.1 pydantic 2.4.2 pydantic_core 2.10.1 pydub 0.25.1 Pygments 2.15.1 pynndescent 0.5.11 pyparsing 3.1.1 python-dateutil 2.8.2 python-multipart 0.0.6 pytz 2023.3 PyYAML 6.0 pyzmq 25.1.0 referencing 0.30.2 regex 2023.6.3 requests 2.31.0 rich 13.6.0 rpds-py 0.12.0 safetensors 0.3.1 scikit-learn 1.3.2 scipy 1.11.4 semantic-version 2.10.0 sentence-transformers 2.2.2 sentencepiece 0.1.99 setuptools 58.1.0 shellingham 1.5.4 six 1.16.0 smart-open 6.4.0 sniffio 1.3.0 spacy 3.7.2 spacy-legacy 3.0.12 spacy-loggers 1.0.5 srsly 2.4.8 stack-data 0.6.2 starlette 0.27.0 sympy 1.12 tenacity 8.2.3 thinc 8.2.1 threadpoolctl 3.2.0 tokenizers 0.13.3 tomlkit 0.12.0 toolz 0.12.0 torch 2.1.1 torchvision 0.16.1 tornado 6.3.2 tqdm 4.65.0 traitlets 5.9.0 transformers 4.30.2 triton 2.1.0 typer 0.9.0 typing_extensions 4.8.0 tzdata 2023.3 umap-learn 0.5.5 urllib3 2.0.3 uvicorn 0.24.0.post1 wasabi 1.1.2 wcwidth 0.2.6 weasel 0.3.4 websockets 11.0.3 wheel 0.40.0 xxhash 3.2.0 yarl 1.9.2 zipp 3.15.0