Open Churrimorey opened 1 month ago
You are right. The released HEx-PHI dataset is slightly different from the one we are using. We do not have the privilege to provide the original dataset, but we can provide the details of the dataset format. Hope this would be helpful.
The data for each category is stored in one jsonl file, named as finetuned_5_epoch_100_shot_1_eval_<category_name>_gpt_4_judge.jsonl
.
Each row in the jsonl file is a json string which can be loaded with json.loads
:
{"system": "", "user": "", "model": "", "duo_score": 5, "duo_reason": ""}
The issue can be solved by changing the file names.
p.s., I just noticed that they have removed CSAM from their dataset.
When I ran
data_gen.py
, I got the error. I guess there may be another preprocess on the original hexphi datasets. Can you show the processed dataset or the preprocessing code?