shuangyichen commented 8 months ago

I noticed that in the finetuning of LLM, there is no accuracy reported. If I want to get the accuracy (of global model), which part of the code should I modify? Thanks!

rayrayraykk commented 8 months ago

Thank you for bringing up this concern. Performing on-the-fly evaluations of LLMs can introduce significant inefficiencies. However, you can still achieve your objective by integrating custom metrics into the evaluation process.

To do this, please refer to the trainer implementation in FederatedScope located here: https://github.com/alibaba/FederatedScope/blob/8da9f9fffc0309acbea7da52a050a59fcd791d52/federatedscope/llm/trainer/trainer.py#L131.

As an example, you could utilize the ctx.model to generate predictions just above the indicated line. Once the results are produced, you can then record them within the eval_metrics dictionary for further processing or analysis.

Please bear in mind the significance of correctly managing the model's state transitions between TRAIN and EVAL modes. Mishandling these states could potentially lead to inaccuracies in your evaluation metrics, as it affects the model's internal processes such as dropout and batch normalization.

shuangyichen commented 8 months ago

Thank you for bringing up this concern. Performing on-the-fly evaluations of LLMs can introduce significant inefficiencies. However, you can still achieve your objective by integrating custom metrics into the evaluation process.

To do this, please refer to the trainer implementation in FederatedScope located here:

https://github.com/alibaba/FederatedScope/blob/8da9f9fffc0309acbea7da52a050a59fcd791d52/federatedscope/llm/trainer/trainer.py#L131

. As an example, you could utilize the ctx.model to generate predictions just above the indicated line. Once the results are produced, you can then record them within the eval_metrics dictionary for further processing or analysis.

Please bear in mind the significance of correctly managing the model's state transitions between TRAIN and EVAL modes. Mishandling these states could potentially lead to inaccuracies in your evaluation metrics, as it affects the model's internal processes such as dropout and batch normalization.

Thanks for the information! I encountered a problem which is the (train,val,test) loss did decrease but the accuracy did not increase. Here is how I compute the accuracy: `#Generate a mask mask = labels != -100

Convert logits to predictions

probabilities = torch.softmax(logits, dim=-1) predictions = torch.argmax(probabilities, dim=-1)

Apply the mask to predictions and labels

filtered_predictions = torch.masked_select(predictions, mask) filtered_labels = torch.masked_select(labels, mask)

accuracy = (filtered_predictions == filtered_labels).float().mean().item()` where logits is one of model outputs. I found most elements in labels are -100 which should be ignored. So I generate the mask in this way. Can you please give me advice to locate the bug?

alibaba / FederatedScope

Customize metric in LLM finetuning #737

Convert logits to predictions

Apply the mask to predictions and labels