Is multi-node evaluation possible?

EvolvingLMMs-Lab / lmms-eval

Accelerating the development of large multimodal models (LMMs) with lmms-eval

https://lmms-lab.github.io/

Other

1.91k stars 145 forks source link

Is multi-node evaluation possible? #234

Open orrzohar opened 2 months ago

orrzohar commented 2 months ago

Hi,

Is multi-node evaluation possible? If yes, how?

Best, Orr

Luodian commented 1 month ago

possible, but we are still experimenting and wish to support it soon.

Luodian commented 1 month ago

@choiszt has already done the trial experiment, may explain it in detail.

choiszt commented 1 month ago

Hi Orr,

Yes, multi-node evaluation is possible, but there is a small modification needed to ensure the correct rank is used when working with multi-node setups.

To enable proper multi-node evaluation, you should update the _rank assignment in the init function within the lmmseval model, such as in lmms_eval/models/llava_onevision.py. Currently, the code uses self._rank = self.accelerator.local_process_index, which works fine for single-node setups but may not properly handle multi-node cases.

You should change to self._rank = self.accelerator.process_index and use appropriate slurm/torchrun commands to launch processes across multiple nodes.

Let me know if you have any questions or need further assistance!

Best,

orrzohar commented 1 month ago

Hi @choiszt, Thank you for your help. I was able to run multi-node evaluation with your proposed changes. Best, Orr

orrzohar commented 1 month ago

@choiszt One thing I did notice: sometimes, the multi-node hangs. I do see that the evaluation completed + printed the results of all the eval. Any reason why? Best, Orr

orrzohar commented 1 month ago

Hi all, Updating this: The cause was that different nodes sometime get different log suffix (via the datetime): https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/e1f9adefcc4b92a39edb2450fbadf659a84364fb/lmms_eval/__main__.py#L467

This causes the reports that the job is finished to sit in different dirs, ultimately messing up the end-of-run collection. I suggest you move the definition of this parameter under the `is_main_process=True' setup. To solve this, I added another command line input that to overwrites the 'datetime'.

choiszt commented 1 month ago

Hi Orr, Thank you for the update, and we appreciate your investigation into the issue！🚀

Just to clarify, it seems like the evaluation scores are being generated correctly, and the issue is with the storage of the reports due to the datetime handling, correct? Please let us know if there are any other aspects we should address.

Best, Shuai

orrzohar commented 1 month ago

Hi Shuai,

Yes what happens is that when torchrun is run near the 'minute' mark, two nodes call utils.get_datetime_str(timezone=args.timezone), but one gets XXXX{rest of str} and the other XXX(X+1){rest of str}.

lmmseval produces these rank{}_metric_eval_done.txt files - but when the above happens, some are in one dir and the other in the other dir. This is the source of the hanging/etc.

My suggestion is calling the datetime_str generator in cli_evaluate and not cli_evaluate_single. You can use the accelerator object to communicate the value you get from the main thread to the rest of the threads, and this should solve this issue.

When I do this on my codebase, multinode works just fine (with the changes you suggested earlier).

Best, Orr