Empty Generations / Failing Reproducing 40% on HumanEval

leonardtang commented 9 months ago

Hi all, I've set up Starcoder as follows:

gen_checkpoint = "bigcode/starcoder"
gen_device = "cuda"
gen_tokenizer, gen_model = setup_model_tokenizer(
    gen_checkpoint, bit_4=False, device=gen_device, bnb_config=None
)

def setup_model_tokenizer(
    path,
    device=None,
    bit_4=False,
    bit_8=False,
    max_memory=None,
    bnb_config=None,
):
    tokenizer = setup_tokenizer(path)
    if torch.cuda.device_count() > 1:
        model = AutoModelForCausalLM.from_pretrained(
            path,
            trust_remote_code=True,
            device_map="auto",
            load_in_4bit=bit_4,
            load_in_8bit=bit_8,
            max_memory=max_memory,
            quantization_config=bnb_config,
        ).eval()
    else:
        if not bit_4 and not bit_8:
            model = (
                AutoModelForCausalLM.from_pretrained(path, trust_remote_code=True)
                .to(device)
                .eval()
            )
        else:
            model = AutoModelForCausalLM.from_pretrained(
                path,
                trust_remote_code=True,
                load_in_4bit=bit_4,
                load_in_8bit=bit_8,
                quantization_config=bnb_config,
            ).eval()
    return tokenizer, model

gen_outputs_dict = gen_model.generate(
        **gen_inputs,
        pad_token_id=gen_tokenizer.eos_token_id,
        max_new_tokens=NEW_TOKENS,
        return_dict_in_generate=True,
        do_sample=True,
        temperature=TEMP,
        top_p=0.95,
        top_k=0,
        stopping_criteria=construct_stopping_criteria(
            "code", STOP_SEQS, gen_tokenizer, gen_device
        ),
    )

The stop tokens I'm using are a subset of those found in the Codex paper: STOP_SEQS = ["\nclass", "\ndef"].

Somehow, it looks like I'm consistently getting empty generations however -- just an EOS token. Concretely, around ~20% of my generations are empty on HumanEval.

I'm using the suggested prompt as well, i.e. "<filename>solutions/solution_1.py\n# Here is the correct implementation of the code exercise\n".

I'm getting around 15% on HumanEval, not 40% as stated in the paper. I'm setting TEMP = 0.2 and NEW_TOKENS=128. Would somebody be able to point out what might be going wrong?

loubnabnl commented 7 months ago

Can you try again using the framework we used for evaluation: https://github.com/bigcode-project/bigcode-evaluation-harness there's an argument for adding a prefix. In your code it's not clear if you stripped the prompts or not (impacts performance), we also use more stop words

geajack commented 4 months ago

I'm having a similar problem - lots of empty generations on a straightforward prompt from HumanEval. For example, this code:

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/starcoder"
device = "cuda" # for GPU usage or "cpu" for CPU usage

prompt = """\
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    \""" Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    \"""
"""

tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_auth_token="<auth_token>")
model = AutoModelForCausalLM.from_pretrained(checkpoint, use_auth_token="<auth_token>", device_map="cuda").to(device)

inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Just generates this output:

Loading checkpoint shards: 100%|██████████| 7/7 [00:32<00:00,  4.63s/it]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    """

def has_close_elements_v2(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    """

def has_close_elements_v3(numbers: List[float], threshold: float) -> bool:
    """ Check if in given list of numbers, are any two numbers closer to each other than
    given threshold.
    """

loubnabnl commented 4 months ago

Hi, this prompt is not stripped you need to remove the trailing \n for it to work properly. I also just run the code from the harness and it reproduces the reported numbers.

bigcode-project / starcoder

Empty Generations / Failing Reproducing 40% on HumanEval #148