VinAIResearch / PhoGPT

PhoGPT: Generative Pre-training for Vietnamese (2023)
Apache License 2.0
739 stars 67 forks source link

HuggingFace tokenizer does not pad to max_length? #30

Closed CQHofsns closed 4 months ago

CQHofsns commented 4 months ago

Dear VinAI team,

Thank you for sharing your work with us. I tried to use your model (PhoGPT tokenizer) and set the max length to 8192, but the tokenizer's output did not add any padding tokens. Here is an example:

phogpt_tokenizer= AutoTokenizer.from_pretrained("vinai/PhoGPT-4B", trust_remote_code= True)
print(
    phogpt_tokenizer(
        "Đây là câu hỏi",
        max_length= 8192,
        truncation= True,
        padding= True
    )
)

The output is: {'input_ids': [2985, 270, 1117, 1378], 'attention_mask': [1, 1, 1, 1]}

You can see that the output token list only has 4 tokens. Should it be 8192 tokens instead?

CQHofsns commented 4 months ago

Sorry my bad, the padding should be set as "padding='max_length'" instead of "padding= True".