Improving the Speed of Evaluation

AlexShypula commented 2 months ago

Given the interest in trying to begin new experiments, it would be helpful to have a faster version of PIE for evaluation. Running all test cases for all programs in the test set with a 120 second per-test case timeout can take many days if not longer to finish.

An ideal fix would be something like the following

Select a subset of (src,tgt) pairs from the test set. There are only ~41 total problem id's in the test set, it could be a sample of 1 pair per-problem id
Select a subset of testcases from the test set to use for gem5 benchmarking. We could use a best subsets regression or some other decision process to choose which subset of the test cases can model the full execution time based on the existing data we have. My intuition is that we should all testcases for correctness, but probably only a handful will suffice to gauge the speedup.

young-chao commented 2 weeks ago

In the evaluation corresponding to the above figure, I set the number of workers to 2. I am confused, even though I increased the number of workers from 40 to 100, the entire gem5 execution time did not decrease. I found from the log that it seems that each test was executed the same number of times as the number of workers minus 2, which may be the reason why increasing the number of workers did not improve efficiency. And I am confused as to why the number of CPUs used should be 2 less than the number of workers.

AlexShypula commented 2 weeks ago

In the evaluation corresponding to the above figure, I set the number of workers to 2. I am confused, even though I increased the number of workers from 40 to 100, the entire gem5 execution time did not decrease. I found from the log that it seems that each test was executed the same number of times as the number of workers minus 2, which may be the reason why increasing the number of workers did not improve efficiency. And I am confused as to why the number of CPUs used should be 2 less than the number of workers.

Gem5 is a CPU-bound task, not I/O bound so increasing number of workers above the number of physical or logical CPUs on your machine will likely not improve performance. I'm not sure if that's what's going on. But you may want to check. If you run htop and the server is at 100% utilization with 40 workers. The use_logical_cpus argument will also manually set the upper limit of cpus to the number of logical cpus on the server minus 2. Here is the logic for that. The initial design choice was to prevent setting number of workers to a high number like 200+ which would slow down gem5 execution and could substantially increase the number of timeouts and the experimental results.

One issue I didn't realize is that these arguments are hard-coded, they should be passed in via the config, and the script should be modified to reflect that.

Also: taking a close look at this log, it looks like the temporary directory id's are different, so it seems its not from the same binary. When evaluating, there are likely many generations for each src program, in our paper, usually this was 8. So for each src, input.i.txt will be executed 8 times for each src program for all i corresponding to the number of input test cases. The fact the two programs/generations seem completely in sync may be that the 2 programs here have nearly identical execution characteristics.

The reason why we also do minus 2 was to allow some extra workers on the server for other tasks, like the parent process, or other potential processes on the server itself e.g. like using VSCode.

young-chao commented 1 week ago

In the evaluation corresponding to the above figure, I set the number of workers to 2. I am confused, even though I increased the number of workers from 40 to 100, the entire gem5 execution time did not decrease. I found from the log that it seems that each test was executed the same number of times as the number of workers minus 2, which may be the reason why increasing the number of workers did not improve efficiency. And I am confused as to why the number of CPUs used should be 2 less than the number of workers.

Gem5 is a CPU-bound task, not I/O bound so increasing number of workers above the number of physical or logical CPUs on your machine will likely not improve performance. I'm not sure if that's what's going on. But you may want to check. If you run htop and the server is at 100% utilization with 40 workers. The use_logical_cpus argument will also manually set the upper limit of cpus to the number of logical cpus on the server minus 2. Here is the logic for that. The initial design choice was to prevent setting number of workers to a high number like 200+ which would slow down gem5 execution and could substantially increase the number of timeouts and the experimental results.

One issue I didn't realize is that these arguments are hard-coded, they should be passed in via the config, and the script should be modified to reflect that.

Also: taking a close look at this log, it looks like the temporary directory id's are different, so it seems its not from the same binary. When evaluating, there are likely many generations for each src program, in our paper, usually this was 8. So for each src, input.i.txt will be executed 8 times for each src program for all i corresponding to the number of input test cases. The fact the two programs/generations seem completely in sync may be that the 2 programs here have nearly identical execution characteristics.

The reason why we also do minus 2 was to allow some extra workers on the server for other tasks, like the parent process, or other potential processes on the server itself e.g. like using VSCode.

Thanks for your reply. I found that the reason why there are fewer gem5.opt processes in the test is that there are fewer input codes and the number of correct codes that can be verified is insufficient. When I set the num_problems_to_evaluate parameter in the yaml file to -1, the number of gem5.opt processes matches the number of CPUs mentioned above.

LearningOpt / pie

Improving the Speed of Evaluation #8