Question about metric chosen

Hi, sorry for the late reply, because I’m quite busy in the past few weeks.

That is a good question. Hits@k and AUC are indeed commonly utilized in link prediction tasks. Nonetheless classification metrics, e.g. F1 score/precision/accuracy, are also commonly adopted in literature on evaluation of link prediction. Here we opted for accuracy as our evaluation metric for convenience, because we mainly aim to evaluate our robust and flexible alignment between graph and token spaces.

To evaluate LLaGA with those ranking metrics, you may combine LLaGA with any one of the llm for ranking techniques, e.g. [1]. I just implemented and updated an evaluation script with the simplest pointwise ranking approach, which involves requiring the model to output a straightforward “yes” or “no” response, and then ranking all samples based on the logit values of the “yes/no” in the initial generated token. You can use eval/eval_pretrain_logit.py instead of eval/eval_pretrain.py to do evaluation on link prediction task and then replace the --task argument for eval/eval_res.py from 'lp' to 'lprank' to get the results.

I did a simple test on Arxiv and the results are as follow, LLaGA still show great performance using these ranking metrics.

Model	Auc	Hit@100
GCN	97.41	17.31
GraphSage	96.95	19.81
LLaGA-ND	97.56	37.65
LLaGA-HO	98.51	45.00

[1] A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models

VITA-Group / LLaGA

Question about metric chosen #4