bigcode-project / starcoder2

Home of StarCoder2!
Apache License 2.0
1.71k stars 158 forks source link

Better inference based on starcode2-3b model #13

Open HeroSong666 opened 6 months ago

HeroSong666 commented 6 months ago

I am new to starcode.

when I run the follow demo:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

checkpoint = "./starcoder2-3b"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16)

inputs = tokenizer.encode("def is_prime(n):", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

it returns:

def is_prime():
    """
    This function checks if a number is prime or not.
    """

it doesn`t finish. so I SET the max_length=120, then it returns:

def is_prime():
    """
    This function checks if a number is prime or not.
    """
    num = int(input("Enter a number: "))
    if num > 1:
        for i in range(2, num):
            if (num % i) == 0:
                print(num, "is not a prime number")
                break
        else:
            print(num, "is a prime number")
    else:
        print(num, "is not a prime number")

is_prime()
<file_sep>/README.md
# Python-

The part

is_prime()
<file_sep>/README.md
# Python-

is redundant. now my solution is:

generated_code = tokenizer.decode(outputs[0])
if "<file_sep>" in generated_code:
    generated_code = generated_code.split("<file_sep>")[0]
print(generated_code)

But I don`t think it a good idea. I want the model to return the results in one go without generating redundant parts. How can I do that? Could you give me some advice?

HeroSong666 commented 6 months ago

Or, I noticed that in https://huggingface.co/bigcode/starcoder2-3b The inference API can generate code piece by piece, each time I press the Compute. How can I implement such functionality? (For example, in python, every time I send a request, the model returns me a portion of the results. The next time I send a request, it will send the request based on the previous request + previous results it returns. In this way, the code can be completed step by step without creating redundant parts.) Many thanks for your advice!