F1 score in 01_fine-tuning-titan-lite.ipynb

jicowan commented 2 months ago

I had to run the following code block 2x before it would output the scores. The first time I ran it, the output was blank:

from bert_score import score
reference_summary = [reference_summary]
fine_tuned_model_P, fine_tuned_R, fine_tuned_F1 = score(fine_tuned_generated_response, reference_summary, lang="en")
base_model_P, base_model_R, base_model_F1 = score(base_model_generated_response, reference_summary, lang="en")
print("F1 score: base model ", base_model_F1)
print("F1 score: fine-tuned model", fine_tuned_F1)

Final output:

F1 score: base model  tensor([0.8868])
F1 score: fine-tuned model tensor([0.8532])

jicowan commented 2 months ago

You might want to consider using the Model Evaluation feature within Bedrock to compare the models rather than using the score function.

w601sxs commented 2 months ago

Thanks @jicowan - we will get back to this; prioritizing other bugs for now.

aws-samples / amazon-bedrock-workshop

F1 score in 01_fine-tuning-titan-lite.ipynb #242