CodeEditorBench / CodeEditorBench

Apache License 2.0
28 stars 1 forks source link

Issues on the evaluation side (add_template.py) #2

Open TusharAggarwalMSR opened 1 month ago

TusharAggarwalMSR commented 1 month ago
  1. model_name should be replaced by model in line 621
  2. code1=inp["code"] should be replaced by code1=inp["code"][0] in line 644 as the code is being saved in a list inside jsonl file
mkj3085003 commented 1 month ago

I've addressed the small issue in the add_template.py script at line 711 and made the necessary updates. Please feel free to try out the newly updated code. Thank you for bringing this to our attention, and don't hesitate to let us know if you notice any new issues during runtime. Your feedback is greatly appreciated!

AggarwalTushar commented 1 month ago

@mkj3085003 I couldn't see the suggested changes in the newly updated code, still facing these issues.

mkj3085003 commented 1 month ago

Can you please take a screenshot to show your detailed error message?

mkj3085003 commented 1 month ago

I've verified this evaluation process from scratch, and the add_template script is executing correctly, and the results that appear in the image below indicate that it's running correctly. Maybe pull the image and code from scratch and try again? Or give some more detail on the issue and I can help you out.

1
AggarwalTushar commented 1 month ago

As suggested by you, I am able to run the add_template file when I create the image again. I was trying to run evaluation for code_debug, and followed the steps given in the readme file. But the count of unsolved solutions is not decreasing over time. I have tried waiting for few hrs. I am also attaching a SS for your reference.

compute_metrics py - tushar  SSH_ 10 219 206 11  - Visual Studio Code 4_17_2024 3_57_05 PM

mkj3085003 commented 1 month ago
  1. You can check to see if result=0 is set correctly for this model's solution:

    SELECT s.solution_id, s.problem_id, s.result 
    FROM solution s 
    JOIN problem p ON s.problem_id = p.problem_id 
    WHERE s.model_id = "your_model_id_here" AND s.result=0;
  2. Then check to see if the judge process is started:

    ps aux | grep judged

    If it is not started you need to

     nohup bash run_judge.sh > runlog.out 2>&1 &    

    start the judged process.

  3. After starting it you can wait a few minutes and check the result again.

    SELECT s.solution_id, s.problem_id, s.result 
    FROM solution s 
    JOIN problem p ON s.problem_id = p.problem_id 
    WHERE s.model_id = "your_model_id_here" AND s.result!=0;

    For example, if there are some RESULTS that are not 0, it means that the problem is being judged.

    You can also check the database architecture given in the readme and some SQL commands to design SQL commands to query.

    You can try it and see if it performs correctly. And if you only measure code debug, you don't need to recalculate the polish time (that's for code polish and it takes a long time to calculate the limit), the judgement is over and you can calculate the metrics directly.

mkj3085003 commented 3 weeks ago

I would like to confirm whether your issue has been resolved. We are currently considering optimizing the entire evalution process, and your suggestions would be very helpful to us. Looking forward to hearing from you.

AggarwalTushar commented 3 weeks ago

I would like to confirm whether your issue has been resolved. We are currently considering optimizing the entire evalution process, and your suggestions would be very helpful to us. Looking forward to hearing from you.

I am facing the same issues as in #6.