Closed drugilsberg closed 7 months ago
Hi, I checked the original results and I verified that are indeed correct. I am investigating right now if the issue is related only to the uploaded checkpoint to HF. I will let you know for that asap.
In the meantime some comments/questions:
I would suggest as a postprocesing step to replace any \\\\
(something like this is enough output=output.replace("<unk>","\\\\").strip()
. We have noticed that some times the model generates some unk tokens instead of \\ in this specific scenario and this simple postprocessing helps to fix it without create problems to other cases. Yet, this should help only the text2molecule case which is mentioned in GT4SD/gt4sd-core#229
Do you use beam search for the generation?
Hi, I checked the original results and I verified that are indeed correct. I am investigating right now if the issue is related only to the uploaded checkpoint to HF. I will let you know for that asap.
In the meantime some comments/questions:
- I would suggest as a postprocesing step to replace any token of the output with
\\\\
(something like this is enoughoutput=output.replace("<unk>","\\\\").strip()
. We have noticed that some times the model generates some unk tokens instead of \ in this specific scenario and this simple postprocessing helps to fix it without create problems to other cases. Yet, this should help only the text2molecule case which is mentioned in GT4SD/gt4sd-core#229- Do you use beam search for the generation?
Thanks for your reply!
test_df = pd.read_csv("hand_annotated_test.csv")
test_paragraphs = list(test_df['paragraphs'])
test_actions_preds = []
for instance in tqdm(test_paragraphs):
input_text = f"Which actions are described in the following paragraph: {instance}"
text = tokenizer(input_text, return_tensors="pt").to(device)
output = model.generate(input_ids=text["input_ids"], max_length=max_length, num_beams=num_beams)
output = tokenizer.decode(output[0].cpu())
output = output.split(tokenizer.eos_token)[0]
output = output.replace(tokenizer.pad_token,"")
output = output.strip()
test_actions_preds.append(output)
Looking forward to the corrected checkpoint, tks!
I updated the demo code to include the needed post processing step for the description to smiles case. Thanks for pointing out this! The number of beams in beam search does affect the generated results as the more the number of beams the more the possible token combinations that are examined during generation. We compare top1 but the best output could vary if we use different number of beams. I will let you know for the checkpoint asap.
Closing due to inactivity
Issue originally opened here: https://github.com/GT4SD/gt4sd-core/issues/229
From Jenonone:
From medicine-wave: