Closed ChuangLee closed 2 months ago
Thank you for your feedback. We will conduct a further investigation into this issue and take the necessary steps to address it.
As a temporary workaround, you could consider adding keywords to bad_words_ids
, to forcibly prevent the model from using these tokens. This method may help alleviate the issue.
for examples:
bad_words = ["//", " //"]
bad_word_ids = [TOKENIZER.encode(bad_word) for bad_word in bad_words]
input_text = """public class TestForLLM {
//Write a method to compare two files for equality, and then test it.
public static boolean compareFiles(String file1, String file2) {"""
model_inputs = TOKENIZER([input_text], return_tensors="pt").to(device)
outputs = MODEL.generate(model_inputs.input_ids, max_new_tokens=512, do_sample=False, bad_words_ids=bad_word_ids)[0]
output_text = TOKENIZER.decode(outputs[:], skip_special_tokens=True)
This greatly reduces the helpfulness of the model.