Question about the classifier used for IntentAccuracyDailyDialog.

According to the source code of class IntentAccuracyDailyDialog(BaseMetric), the intent likelihood of utterances on DailyDialog is computed by rajkumarrrk/roberta-daily-dialog-intent-classifier.

However, according to the config.json of this classifier, it is used for emotion classification, with four labels: joy, optimism, anger, and sadness, while the intent labels on DailyDialog should be Inform, Questions, Directives, and Commissive instead.

So my question is: Is this classifier already fine-tuned on intent classification of DailyDialog utterances?

Empirically, i obeserve that the classification results of ground truth utterances in DailyDialog by this classifier are unbalanced and not well-aligned to the labelled intent distribution, as shown below.

classification results on test set

	label-0	label-1	label-2	label-3	Intent Accuracy
classification on ground truth	0.7102	0.0055	0.0275	0.2071	0.6147
intent labels in DailyDialog	0.4988	0.2231	0.1565	0.1213	-
classification on SFT generation	0.5363	0.1591	0.0944	0.2100	0.4034

allenai / RL4LMs

Question about the classifier used for IntentAccuracyDailyDialog. #71