Closed carmocca closed 1 year ago
(Not confident on this, take with a grain of salt, this is based on a bit of quick research) It looks like GPT-NeoX defines "HFTokenizer" as a tokenizer type, and HF defines "GPTNeoXTokenizer" as a tokenizer type. ie each project has a tokenizer type defined by the name of the other project lol. Bit confusing but makes sense.
So, if you're writing your own code to handle StableLM, the correct class to use in your own code if you want to do that, is dependent on which library you're using.
If you're using https://github.com/EleutherAI/gpt-neox - use HFTokenizer
https://github.com/EleutherAI/gpt-neox/blob/main/megatron/tokenizer/tokenizer.py#L224
If you're using https://github.com/huggingface/transformers - use GPTNeoXTokenizerFast
https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/tokenization_gpt_neox_fast.py
(or just use autoloading libraries / copy from already-working examples, and save yourself the confusion)
That makes sense. Thank you!
The repo yaml points to
HFTokenizer
https://github.com/Stability-AI/StableLM/blob/e60081/configs/stablelm-base-alpha-3b.yaml#L108But the HF upload points to
GPTNeoXTokenizer
https://huggingface.co/stabilityai/stablelm-base-alpha-3b/blob/main/tokenizer_config.json#L7Which one is correct?