Unable to replicate Benchmark results and generation is broken

Hi Meta team, we just added MobileLLM support to GPTQModel for 4bit gptq quantization. However, we are running into problem where we are trying to establish (recreate) baseline native/bf16 benchmark values using lm-eval. lm-eval is able to run but score is horrendous relative to other 1B models even those from Llama 3.2 1B. We suspect there is an issue with tokenizer?

lm_eval --model hf --model_args pretrained="/monster/data/model/MobileLLM-1B/",parallelize=True,use_fast_tokenizer=False --device cuda --tasks ifeval,gsm8k_cot --batch_size 32 --trust_remote_code

We are also running into problem using pure hf transformers to run the model and the output is just repeating repsonse as if EOS is never generated:

    model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="auto", device_map="auto", attn_implementation="flash_attention_2", trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side='left', trust_remote_code=True, use_fast=False)
    tokenizer.add_special_tokens(
        {
            "eos_token": "</s>",
            "bos_token": "<s>",
            "unk_token": "<unk>",
        }
    )
  tokenizer.pad_token_id = tokenizer.unk_token_id

These two issues are related we suspect as the repeating output is causing degradation of lm-eval benchmarks.

Name: transformers
Version: 4.46.1

What is the correct way to generate output with MobileLLM and configure the tokenizer? Can you guys provide a few sample input/ouput so we can verify if this model code, config, or tokenizer related?

Even better, which tool is meta using to generate the benchmark results? As of now, lm-eval results are pitting this model well below Llama 3.2 1B on multiple fronts.

@liuzechun

facebookresearch / MobileLLM

Unable to replicate Benchmark results and generation is broken #15