FranxYao / Long-Context-Data-Engineering

Implementation of paper Data Engineering for Scaling Language Models to 128K Context
445 stars 29 forks source link

Is there GPT_CHAR_TO_TOKEN_RATIO? #4

Closed ZetangForward closed 9 months ago

ZetangForward commented 9 months ago

Hi, thanks for your great work. I notice there is a ``LLAMA_CHAR_TO_TOKEN_RATIO'' hyper-parameter in your script. I want to test GPT-Neo with your script, can you provide the hyper-parameter of GPT_CHAR_TO_TOKEN_RATIO? Thx

FranxYao commented 9 months ago

I actually not yet tested it, but you can actually estimate it your self by letting the GPT-Neo tokenizer to tokenize, say about 500M data (does not take much time), and get the CHAR_TO_TOKEN_RATIO

ZetangForward commented 9 months ago

I actually not yet tested it, but you can actually estimate it your self by letting the GPT-Neo tokenizer to tokenize, say about 500M data (does not take much time), and get the CHAR_TO_TOKEN_RATIO

ok, I will try, thx