GT4SD / multitask_text_and_chemistry_t5

Code for "Unifying Molecular and Textual Representations via Multi-task Language Modelling" @ ICML 2023
https://huggingface.co/spaces/GT4SD/multitask-text-and-chemistry-t5
MIT License
31 stars 1 forks source link

Problem of the metrics in the paper #1

Closed zw-SIMM closed 1 year ago

zw-SIMM commented 1 year ago

Nice work! I have some problems of the metrics used in the table1, especially about the task of text2text(paragraph-actions). You have mentioned that “For the forward prediction task the metric is accuracy; for the retrosynthesis task the metric is roundtrip accuracy (Schwaller et al., 2020); for all the other tasks the BLEU score.” So what metric do you use in the task of text2text(paragraph-actions)? BLEU-2 or BLEU-4?

christofid commented 1 year ago

Hi @medicine-wave, we used BLEU-2 for the paragraph-to-actions task.

zw-SIMM commented 1 year ago

Hi @medicine-wave, we used BLEU-2 for the paragraph-to-actions task.

OK, thanks, can you provide the code for easier evaluation? Because I found the BLEU score metrics used in Paragraph2actions[2020 A.Vaucher] is corpus-bleu(bleu4), is there something different?

christofid commented 1 year ago

In Paragraph2actions[2020 A.Vaucher], they used a slightly modified BLEU score while in our case we used the BLEU-2 metric as we stuck to the existing NLP metrics used also for the rest tasks of interest in this work.

I attach the piece of code that we used to compute this metric. I will try to find some time to clean up and provide the whole evaluation script.

 from nltk.translate.bleu_score import corpus_bleu

  for gt, out in outputs:

      gt_tokens = [c for c in gt]

      out_tokens = [c for c in out]

      references.append([gt_tokens])
      hypotheses.append(out_tokens)

  bleu2 = corpus_bleu(references, hypotheses, weights=(.5, .5))
zw-SIMM commented 1 year ago

In Paragraph2actions[2020 A.Vaucher], they used a slightly modified BLEU score while in our case we used the BLEU-2 metric as we stuck to the existing NLP metrics used also for the rest tasks of interest in this work.

I attach the piece of code that we used to compute this metric. I will try to find some time to clean up and provide the whole evaluation script.

 from nltk.translate.bleu_score import corpus_bleu

  for gt, out in outputs:

      gt_tokens = [c for c in gt]

      out_tokens = [c for c in out]

      references.append([gt_tokens])
      hypotheses.append(out_tokens)

  bleu2 = corpus_bleu(references, hypotheses, weights=(.5, .5))

Many thanks for your reply!