NVIDIA / NVFlare

NVIDIA Federated Learning Application Runtime Environment
https://nvidia.github.io/NVFlare/
Apache License 2.0
592 stars 165 forks source link

[BUG] RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists. #2698

Closed KumoLiu closed 1 month ago

KumoLiu commented 1 month ago
2024-07-14 10:02:17,649 - INFO - Load site-1 weights...
2024-07-14 10:02:17,652 - INFO - Converted 148 global variables to match 148 local variables.
2024-07-14 10:02:17,654 - INFO - 'dst' model updated: 148 of 148 variables.
2024-07-14 10:02:17,661 - Communicator - INFO - Received from secure_project server. getTask: train size: 19.3MB (19280090 Bytes) time: 0.301346 seconds
2024-07-14 10:02:17,661 - FederatedClient - INFO - pull_task completed. Task name:train Status:True 
2024-07-14 10:02:17,661 - ClientRunner - INFO - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781]: got task assignment: name=train, id=00b8bb4c-1fbd-421d-81b3-19472481fd48
2024-07-14 10:02:17,661 - ClientRunner - INFO - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, task_name=train, task_id=00b8bb4c-1fbd-421d-81b3-19472481fd48]: invoking task executor ClientAlgoExecutor
2024-07-14 10:02:17,662 - INFO - Start site-1 evaluating...
2024-07-14 10:02:17,662 - ClientAlgoExecutor - INFO - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, task_name=train, task_id=00b8bb4c-1fbd-421d-81b3-19472481fd48]: Client trainer got task: train
2024-07-14 10:02:17,662 - ignite.engine.engine.SupervisedEvaluator - INFO - Engine run resuming from iteration 0, epoch 0 until 1 epochs
2024-07-14 10:02:17,662 - INFO - Load site-2 weights...
2024-07-14 10:02:17,664 - INFO - Converted 148 global variables to match 148 local variables.
2024-07-14 10:02:17,665 - INFO - 'dst' model updated: 148 of 148 variables.
2024-07-14 10:02:17,672 - INFO - Start site-2 evaluating...
2024-07-14 10:02:17,672 - ignite.engine.engine.SupervisedEvaluator - INFO - Engine run resuming from iteration 0, epoch 0 until 1 epochs
2024-07-14 10:02:17,743 - ignite.engine.engine.SupervisedEvaluator - ERROR - Engine run is terminating due to exception: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
2024-07-14 10:02:17,744 - ERROR - Exception: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 946, in _internal_run_as_gen
    self._fire_event(Events.STARTED)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 425, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 209, in start
    self._set_experiment()
  File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 241, in _set_experiment
    experiment_id = self.client.create_experiment(self.experiment_name)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/client.py", line 599, in create_experiment
    return self._tracking_client.create_experiment(name, artifact_location, tags)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/_tracking_service/client.py", line 251, in create_experiment
    return self.store.create_experiment(
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 101, in create_experiment
    response_proto = self._call_endpoint(CreateExperiment, req_body)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 60, in _call_endpoint
    return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 290, in call_endpoint
    response = verify_rest_response(response, endpoint)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 173, in verify_rest_response
    raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
2024-07-14 10:02:17,745 - ClientAlgoExecutor - ERROR - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, task_name=train, task_id=00b8bb4c-1fbd-421d-81b3-19472481fd48]: client_algo execute exception: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
2024-07-14 10:02:17,745 - ClientAlgoExecutor - ERROR - Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/monai_nvflare/client_algo_executor.py", line 114, in execute
    return self.train(shareable, fl_ctx, abort_signal)
  File "/usr/local/lib/python3.10/dist-packages/monai_nvflare/client_algo_executor.py", line 132, in train
    test_report = self.client_algo.evaluate(exchangeobj_from_shareable(shareable))
  File "/usr/local/lib/python3.10/dist-packages/monai/fl/client/monai_algo.py", line 664, in evaluate
    self.evaluator.run(self.trainer.state.epoch + 1)
  File "/usr/local/lib/python3.10/dist-packages/monai/engines/evaluator.py", line 150, in run
    super().run()
  File "/usr/local/lib/python3.10/dist-packages/monai/engines/workflow.py", line 283, in run
    super().run(data=self.data_loader, max_epochs=self.state.max_epochs)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 892, in run
    return self._internal_run()
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 935, in _internal_run
    return next(self._internal_run_generator)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 993, in _internal_run_as_gen
    self._handle_exception(e)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 636, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 425, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/monai/handlers/stats_handler.py", line 202, in exception_raised
    raise e
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 946, in _internal_run_as_gen
    self._fire_event(Events.STARTED)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 425, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 209, in start
    self._set_experiment()
  File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 241, in _set_experiment
    experiment_id = self.client.create_experiment(self.experiment_name)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/client.py", line 599, in create_experiment
    return self._tracking_client.create_experiment(name, artifact_location, tags)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/_tracking_service/client.py", line 251, in create_experiment
    return self.store.create_experiment(
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 101, in create_experiment
    response_proto = self._call_endpoint(CreateExperiment, req_body)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 60, in _call_endpoint
    return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 290, in call_endpoint
    response = verify_rest_response(response, endpoint)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 173, in verify_rest_response
    raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.

Looks like there has some issue when running the monai real word example. When site-2 start evaluating, the monai_nvflare experiment is already exist. Should handle such case.

KumoLiu commented 1 month ago

cc @YuanTingHsieh @holgerroth

KumoLiu commented 1 month ago

I attempted to add a try-except block to the set_experiment function in the MLflow handler. However, I'm uncertain if this achieves the desired behavior.

    def _set_experiment(self):
        experiment = self.experiment
        if not experiment:
            for attempt in range(3):
                try:
                    experiment = self.client.get_experiment_by_name(self.experiment_name)
                    if not experiment:
                        experiment_id = self.client.create_experiment(self.experiment_name)
                        experiment = self.client.get_experiment(experiment_id)
                    break
                except MlflowException as e:
                    if "RESOURCE_ALREADY_EXISTS" in str(e):
                        time.sleep(self.retry_delay)
                        continue
                    else:
                        raise e
YuanTingHsieh commented 1 month ago

@KumoLiu what about we add a line asking people to create this experiment first?

Like a one line code using MLFlow to create that experiment?

KumoLiu commented 1 month ago

@KumoLiu what about we add a line asking people to create this experiment first?

Like a one line code using MLFlow to create that experiment?

Hi @YuanTingHsieh, thanks for the suggestion! The mlflowhander is included inside the bundle. And the issue here is that when two sites create the experiment at the same time, it will throw this error. One possible solution is that try-catch the error during creating the experiment. What do you think?