Open awasthiabhijeet opened 3 weeks ago
Thank you for your attention and for pointing out the issues, we are in the process of optimising the whole evalution process and your suggestions are very useful. Can you try running the following SQL command to see the status of a solution in the database? See if there are any results that are not 0? I can't tell what the problem is because I don't have specific status information? It seems that someone else has encountered this problem before, but I haven't found out why, so I look forward to your reply. Thank you again.
SELECT s.solution_id, s.problem_id, s.result
FROM solution s
JOIN problem p ON s.problem_id = p.problem_id
WHERE s.model_id = ‘your_model_id_here’ AND s.result!=0;
SELECT s.solution_id, s.problem_id, s.result
FROM solution s
JOIN problem p ON s.problem_id = p.problem_id
WHERE s.model_id = 60 AND s.result!=0;
Here model_id should be assigned as an int, not a string, I just want to show that it needs to be replaced here, sorry for the misunderstanding!
Thanks @mkj3085003.
I ran the following SQL. My model_id
is 60
(int). All the results seem to be 2
.
SELECT s.solution_id, s.problem_id, s.result
FROM solution s
JOIN problem p ON s.problem_id = p.problem_id
WHERE s.model_id = 60 AND s.result!=0;
Ok, I will check it out and get back to you as soon as possible, thanks!
Sorry,I found it was due to missing input data,probably due to the data folder failing to upload as a large file using lfs,but I forgot about it,I will sort it out and upload it.
Hi @mkj3085003 : do you mean copying hf dataset inside data/
folder?
We already did that before running evaluation scripts.
No, it's the processed inputs and outputs, not quite the same as the hugging face dataset, it's the result of processing it to OJ, I'll upload it now, sorry for the delay!
I've uploaded data.tar.gz using git lfs, you can download and unzip it to judge/data, then you need to create a new log folder under judge (judge/log), I've updated run_judge.sh to add this mkdir command, you can also pull the new run_judge. sh. When all the judges are finished, you can run bash stop.sh (it will close the judging and clean up the run folder and client.pid etc. that are temporarily generated by the judge).
After the judging process, you can skip step 5 and calculate the metrics first, as the process of computing the Polish metric may take a long time, possibly up to a week. However, subsequent runs won't require recalculating these limits. You can calculate the metrics first, and the scores for "code debug", "code translate", and "code switch" can be directly computed. Please try to see if you can run judge correctly now and I look forward to your reply.
Hi @mkj3085003 , I'm getting Error downloading object: evaluation/judge/data.tar.gz (8ad3efd): Smudge error: Error downloading evaluation/judge/data.tar.gz (8ad3efd4e2a7f1a968a39be20d2c5d90a9b1fe528c4fb130a981b2ec8e3f5235): batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
. Is there any other way to access the data? Thanks!
Okay, I will upload the data to Google Drive and share the link, please wait a moment!
Thank you for open-sourcing this evaluation benchmark.
I am trying to replicate the inference and evaluation steps as suggested in the repository. Inference scripts work well for me.
However, I am facing the following problems with the evaluation.
greedy_result
folder obtained during inference toevaluation/judge/solution_folder
, because I initially got the following error.Next, I get the following error. It seems
vllm_inference.py
adds an additional header line in the jsonl file that needs to be removed.After removing the header lines from the jsonl files, I then get the following error.
Fixing the above error required modifying line 644 in
add_template.py
(code1=inp["code"][0]
)After this change,
python3 add_template.py
worked for me.python3 submit_solution.py
also works without any warning/error.Then, I ran the judge
nohup bash run_judge.sh > runlog.out 2>&1
.However, runlog.out remains empty even after an hour of running the script.
Here are the outputs of some of the SQL queries I ran
ps -ef | grep judge
gives the following output.Overall, I think evaluation is not currently working out of the box. I would be very helpful if evaluation process runs without errors / additional modifications.
Regards, Abhijeet