Open orrzohar opened 2 months ago
possible, but we are still experimenting and wish to support it soon.
@choiszt has already done the trial experiment, may explain it in detail.
Hi Orr,
Yes, multi-node evaluation is possible, but there is a small modification needed to ensure the correct rank is used when working with multi-node setups.
To enable proper multi-node evaluation, you should update the _rank assignment in the init function within the lmmseval model, such as in lmms_eval/models/llava_onevision.py. Currently, the code uses self._rank = self.accelerator.local_process_index, which works fine for single-node setups but may not properly handle multi-node cases.
You should change to self._rank = self.accelerator.process_index
and use appropriate slurm/torchrun commands to launch processes across multiple nodes.
Let me know if you have any questions or need further assistance!
Best,
Hi @choiszt, Thank you for your help. I was able to run multi-node evaluation with your proposed changes. Best, Orr
@choiszt One thing I did notice: sometimes, the multi-node hangs. I do see that the evaluation completed + printed the results of all the eval. Any reason why? Best, Orr
Hi all, Updating this: The cause was that different nodes sometime get different log suffix (via the datetime): https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/e1f9adefcc4b92a39edb2450fbadf659a84364fb/lmms_eval/__main__.py#L467
This causes the reports that the job is finished to sit in different dirs, ultimately messing up the end-of-run collection. I suggest you move the definition of this parameter under the `is_main_process=True' setup. To solve this, I added another command line input that to overwrites the 'datetime'.
Hi Orr, Thank you for the update, and we appreciate your investigation into the issue!🚀
Just to clarify, it seems like the evaluation scores are being generated correctly, and the issue is with the storage of the reports due to the datetime handling, correct? Please let us know if there are any other aspects we should address.
Best, Shuai
Hi Shuai,
Yes what happens is that when torchrun is run near the 'minute' mark, two nodes call utils.get_datetime_str(timezone=args.timezone)
, but one gets XXXX{rest of str} and the other XXX(X+1){rest of str}.
lmmseval produces these rank{}_metric_eval_done.txt
files - but when the above happens, some are in one dir and the other in the other dir. This is the source of the hanging/etc.
My suggestion is calling the datetime_str generator in cli_evaluate
and not cli_evaluate_single
. You can use the accelerator object to communicate the value you get from the main thread to the rest of the threads, and this should solve this issue.
When I do this on my codebase, multinode works just fine (with the changes you suggested earlier).
Best, Orr
Hi,
Is multi-node evaluation possible? If yes, how?
Best, Orr