Stability-AI / StableLM

StableLM: Stability AI Language Models
Apache License 2.0
15.85k stars 1.04k forks source link

Unclear tokenizer class #73

Closed carmocca closed 1 year ago

carmocca commented 1 year ago

The repo yaml points to HFTokenizer https://github.com/Stability-AI/StableLM/blob/e60081/configs/stablelm-base-alpha-3b.yaml#L108

But the HF upload points to GPTNeoXTokenizer https://huggingface.co/stabilityai/stablelm-base-alpha-3b/blob/main/tokenizer_config.json#L7

Which one is correct?

mcmonkey4eva commented 1 year ago

(Not confident on this, take with a grain of salt, this is based on a bit of quick research) It looks like GPT-NeoX defines "HFTokenizer" as a tokenizer type, and HF defines "GPTNeoXTokenizer" as a tokenizer type. ie each project has a tokenizer type defined by the name of the other project lol. Bit confusing but makes sense.

So, if you're writing your own code to handle StableLM, the correct class to use in your own code if you want to do that, is dependent on which library you're using.

If you're using https://github.com/EleutherAI/gpt-neox - use HFTokenizer https://github.com/EleutherAI/gpt-neox/blob/main/megatron/tokenizer/tokenizer.py#L224 If you're using https://github.com/huggingface/transformers - use GPTNeoXTokenizerFast https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/tokenization_gpt_neox_fast.py (or just use autoloading libraries / copy from already-working examples, and save yourself the confusion)

carmocca commented 1 year ago

That makes sense. Thank you!