[logging] The program is not responding during multi-gpu evaluation.
The program froze and did not respond during multi-gpu evaluation.
After checking the code, I've noticed that multi-gpu evaluation will create a "rank0_metric_eval_done.txt" under the result output-dir. (from [New Updates] LLaVA OneVision Release; MVBench, InternVL2, IXC2.5 Interleave-Bench integration. [#182]). However, due to the gap among the logging dir creation (process 0 is often slower). There are 2 folders created :
The logging procedure is done in 2 different minutes. So, the "txt" files are saved in different dirs. Subsequently, the program will run into a while loop to check whether the "txt" from all GPUs already exist, which will cause a dead loop.
FIX:my way is simple, I just cancelled the minutes display during the creation of logging dir:
(lmms-eval/lmms_eval/utils.py)
Maybe there exist more appropriate methods.
[logging] The program is not responding during multi-gpu evaluation. The program froze and did not respond during multi-gpu evaluation.
After checking the code, I've noticed that multi-gpu evaluation will create a "rank0_metric_eval_done.txt" under the result output-dir. (from [New Updates] LLaVA OneVision Release; MVBench, InternVL2, IXC2.5 Interleave-Bench integration. [#182]). However, due to the gap among the logging dir creation (process 0 is often slower). There are 2 folders created : The logging procedure is done in 2 different minutes. So, the "txt" files are saved in different dirs. Subsequently, the program will run into a while loop to check whether the "txt" from all GPUs already exist, which will cause a dead loop.
FIX:my way is simple, I just cancelled the minutes display during the creation of logging dir: (lmms-eval/lmms_eval/utils.py) Maybe there exist more appropriate methods.