deeplearning-wisc / haloscope

source code for NeurIPS'24 paper "HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection"
19 stars 3 forks source link

Reproduction Problem in TruthfulQA #1

Open Tan-Hexiang opened 3 weeks ago

Tan-Hexiang commented 3 weeks ago

When i try to reproduce the result following the instruction in READEME, I get the following result in TruthfulQA for Llama-2-7b. AUROC is 60.36, which is far from 78.64 in Table 1. The full output is showing as follows:

direct-projection
  FPR95 AUROC AUPR
& 94.92 & 60.36 & 51.37
thres:  0.02564102564102564 best result:  [69.12137151763102] best_layer:  2
thres:  0.05128205128205128 best result:  [68.18624586012078] best_layer:  2
thres:  0.07692307692307693 best result:  [66.66666666666667] best_layer:  2
thres:  0.10256410256410256 best result:  [65.40035067212156] best_layer:  2
thres:  0.1282051282051282 best result:  [64.98149230469511] best_layer:  2
thres:  0.15384615384615385 best result:  [64.71848821352036] best_layer:  2
thres:  0.1794871794871795 best result:  [62.26378336255601] best_layer:  2
thres:  0.20512820512820512 best result:  [62.75082797584257] best_layer:  5
thres:  0.23076923076923075 best result:  [62.46834210013637] best_layer:  4
thres:  0.2564102564102564 best result:  [62.57549191505941] best_layer:  5
thres:  0.28205128205128205 best result:  [62.02026105591272] best_layer:  7
thres:  0.3076923076923077 best result:  [59.40970192869667] best_layer:  9
thres:  0.3333333333333333 best result:  [59.65322423533996] best_layer:  22
thres:  0.358974358974359 best result:  [58.766803039158376] best_layer:  0
thres:  0.3846153846153846 best result:  [58.9226573154101] best_layer:  0
thres:  0.41025641025641024 best result:  [62.21020845509449] best_layer:  29
thres:  0.4358974358974359 best result:  [63.091759205143184] best_layer:  28
thres:  0.4615384615384615 best result:  [63.559322033898304] best_layer:  28
thres:  0.48717948717948717 best result:  [62.44398986947205] best_layer:  30
thres:  0.5128205128205128 best result:  [63.491135788038186] best_layer:  29
thres:  0.5384615384615384 best result:  [62.14689265536724] best_layer:  28
thres:  0.5641025641025641 best result:  [61.659848042080654] best_layer:  29
thres:  0.5897435897435898 best result:  [61.96181570231833] best_layer:  28
thres:  0.6153846153846154 best result:  [61.12409896746542] best_layer:  28
thres:  0.641025641025641 best result:  [60.792908630430546] best_layer:  0
thres:  0.6666666666666666 best result:  [60.86109487629068] best_layer:  30
thres:  0.6923076923076923 best result:  [60.19871420222091] best_layer:  29
thres:  0.717948717948718 best result:  [60.2863822326125] best_layer:  30
thres:  0.7435897435897436 best result:  [60.66627703097603] best_layer:  29
thres:  0.7692307692307692 best result:  [63.14046366647186] best_layer:  30
thres:  0.7948717948717948 best result:  [61.29943502824858] best_layer:  27
thres:  0.8205128205128205 best result:  [61.36762127410871] best_layer:  29
thres:  0.8461538461538461 best result:  [61.133839859731154] best_layer:  30
thres:  0.8717948717948718 best result:  [60.62731346191311] best_layer:  23
thres:  0.8974358974358974 best result:  [61.51373465809469] best_layer:  27
thres:  0.923076923076923 best result:  [61.971556594584065] best_layer:  31
thres:  0.9487179487179487 best result:  [62.80927332943698] best_layer:  30
thres:  0.9743589743589743 best result:  [60.58834989285019] best_layer:  17

The command I used:

python3 hal_det_llama.py --dataset_name tqa --model_name llama2_chat_7B --most_likely 1 --num_gene 1 --gene 1
python3 hal_det_llama.py --dataset_name tqa --model_name llama2_chat_7B --most_likely 1 --use_rouge 0 --generate_gt 1
python3 hal_det_llama.py --dataset_name tqa --model_name llama2_chat_7B --most_likely 0 --num_gene 10 --gene 1
python3 hal_det_llama.py --dataset_name tqa --model_name llama2_chat_7B --most_likely 0 --use_rouge 0 --generate_gt 1
CUDA_VISIBLE_DEVICES=1 python3 hal_det_llama.py --dataset_name tqa --model_name llama2_chat_7B --use_rouge 0 --most_likely 1 --weighted_svd 1 --feat_loc_svd 3

Could u please help me reproduce the result in Table 1? @d12306

QingyangZhang commented 3 weeks ago

Thanks for your impressive work and effort to make the repo public. When I try to reproduce the results, I got the same results with @Tan-Hexiang. According to my personal understanding, HaloScope needs to tune the hyper-parameters (i.e., thres_wild and layer, line 606 in Hal_det_llama.py) on the test dataset. Thus the AUROC should be 69.12, which is still far from 78.64 in Table 1. Besides, tuning on test dataset seems involving information leakage from test set. I am not sure whether my understanding is wrong.

Tan-Hexiang commented 3 weeks ago

@QingyangZhang In my understanding, treating thres_wild as a hyperparameter for tuning is unreasonable. Thres_wild relates to the correctness judgment of response correctness. From my perspective, we should choose a thres_wild that has a higher agreement ratio with human correctness. Adjusting thres_wild to a very high value (e.g., 0.999) can allow a classifier that always outputs 0 to achieve good performance.

@d12306 Thank you very much for clearly sharing your code. Is the result of 78.64 obtained using the same parameters as in the README, or do other parameters need to be adjusted, such as weighted_svd and feat_loc_svd? This would be very helpful to me.

d12306 commented 3 weeks ago

hi @QingyangZhang @Tan-Hexiang , i reproduced on my code, and it is quite similar to the results in the paper.

image

Could you try different seeds? and maybe set maximum search range for k to 9, I am not sure what is reason why you are getting a much lower number. I would also suggest try to debug on whether using the gt label on the unlabeled data can get a high auroc, and then try different seeds, different datasets. Let me know if these do not work by following up in this thread. (Sorry I might be slow in responding since i am very busy recently.)

In terms of using the test set, it is to show the trend on the test set given all the thresholds, you can definitely choose the threshold on validation set

d12306 commented 3 weeks ago

hi, @QingyangZhang @Tan-Hexiang , i found that if i set the seed to 41, the auroc can be higher than what i reported.

FPR95 AUROC AUPR & 96.06 & 74.31 & 66.52 thres: 0.02564102564102564 best result: [74.70724813244499] best_layer: 1 thres: 0.05128205128205128 best result: [73.7330910559257] best_layer: 16 thres: 0.07692307692307693 best result: [74.82333939026852] best_layer: 14 thres: 0.10256410256410256 best result: [75.78235412881081] best_layer: 3 thres: 0.1282051282051282 best result: [76.0347264284272] best_layer: 3 thres: 0.15384615384615385 best result: [76.48899656773672] best_layer: 2 thres: 0.1794871794871795 best result: [76.19624470018171] best_layer: 2 thres: 0.20512820512820512 best result: [75.33817888148596] best_layer: 2 thres: 0.23076923076923075 best result: [75.8328285887341] best_layer: 5 thres: 0.2564102564102564 best result: [76.38804764789016] best_layer: 5 thres: 0.28205128205128205 best result: [77.59943468604887] best_layer: 5 thres: 0.3076923076923077 best result: [78.07389460932768] best_layer: 4 thres: 0.3333333333333333 best result: [78.60892388451445] best_layer: 4 thres: 0.358974358974359 best result: [78.69977791237633] best_layer: 4 thres: 0.3846153846153846 best result: [79.29537653947104] best_layer: 4 thres: 0.41025641025641024 best result: [80.10296789824349] best_layer: 4 thres: 0.4358974358974359 best result: [80.17363214213609] best_layer: 4 thres: 0.4615384615384615 best result: [80.44619422572178] best_layer: 4 thres: 0.48717948717948717 best result: [80.78942055320009] best_layer: 4 thres: 0.5128205128205128 best result: [80.7187563093075] best_layer: 4 thres: 0.5384615384615384 best result: [80.59761760549162] best_layer: 4 thres: 0.5641025641025641 best result: [80.96103371693923] best_layer: 4 thres: 0.5897435897435898 best result: [80.88027458106197] best_layer: 4 thres: 0.6153846153846154 best result: [80.58752271350697] best_layer: 4 thres: 0.641025641025641 best result: [80.7389460932768] best_layer: 4 thres: 0.6666666666666666 best result: [80.76923076923077] best_layer: 4 thres: 0.6923076923076923 best result: [81.66767615586514] best_layer: 20 thres: 0.717948717948718 best result: [81.0115081768625] best_layer: 28 thres: 0.7435897435897436 best result: [80.50676357762971] best_layer: 2 thres: 0.7692307692307692 best result: [80.51685846961438] best_layer: 2 thres: 0.7948717948717948 best result: [79.97173430244295] best_layer: 29 thres: 0.8205128205128205 best result: [80.12315768221279] best_layer: 3 thres: 0.8461538461538461 best result: [79.42661013527155] best_layer: 3 thres: 0.8717948717948718 best result: [78.85120129214617] best_layer: 29

Tan-Hexiang commented 3 weeks ago

@d12306 Thank you so very much for your reply. I truly appreciate the time and effort you took to respond. I have just tried seed 41 and obtained a result on TruthfulQA that is similar to what is mentioned in the paper.

However, I still have several other questions. If you could assist me in resolving some of these, I would be extremely grateful.

QingyangZhang commented 3 weeks ago

Hi, all. I reproduce the results on TruthfulQA and TrivialQA and get the similar results as reported in the paper. Thanks a lot for the timely reply from @d12306. @Tan-Hexiang I guess the performance of haloscope could be highly influenced by the choice of layers to be probed. In the original implementation, only 100 samples are used for validation. Thus the performance is unstable on TruthfulQA (as this is a relatively small dataset with 800+ samples). On TrivialQA, the performance I got is also much worsen than that reported in the paper at first. However, after I tried to use 200 samples for validation, I can get much better results now. @d12306 Thanks again for your reply. This is really an awesome and impressive work to me. : )

Tan-Hexiang commented 3 weeks ago

Hi, all. I reproduce the results on TruthfulQA and TrivialQA and get the similar results as reported in the paper. Thanks a lot for the timely reply from @d12306. @Tan-Hexiang I guess the performance of haloscope could be highly influenced by the choice of layers to be probed. In the original implementation, only 100 samples are used for validation. Thus the performance is unstable on TruthfulQA (as this is a relatively small dataset with 800+ samples). On TrivialQA, the performance I got is also much worsen than that reported in the paper at first. However, after I tried to use 200 samples for validation, I can get much better results now. @d12306 Thanks again for your reply. This is really an awesome and impressive work to me. : )

Thank you very much for your reminder. This is likely the reason, and I will try to expand the validation samples. Very happy to discuss and reproduce this interesting work with you.