Closed tzj-scau closed 1 year ago
The duplicate key problem is real. You have two containers, but one is reporting metrics before the other even says Creating _PyTorchTrialController with OneVarPytorchTrial
.
The first time the second container reports metrics, you get the error.
Since I see the [rank=N]
bits in the logs, and since you are using the legacy entrypoint
format (that is, model_def:TrialClass
), that means you are definitely inside of horovodrun, but somehow your PyTorchTrial isn't recognizing the fact that there's two workers that need to coordinate.
Have you modified the determined library by chance? This strikes me as an impossible bug.
If you haven't modified it, please add a print statement in your OneVarTrial.__init__()
:
print('rank', context.distributed.rank)
print('size', context.distributed.size)
print('local_rank', context.distributed.local_rank)
print('local_size', context.distributed.local_size)
print('cross_rank', context.distributed.cross_rank)
print('cross_size', context.distributed.cross_size)
and share the resulting logs.
Thank you for your reply, I didn't modify the determined library, I added the following code to the init as you requested
(base) PS A:\mage-main> det e create const.yaml . -f
Preparing files to send to master... 4.6MB and 50 files
Created experiment 268
Waiting for first trial to begin...
Following first trial with ID 268
[2023-09-21T00:34:25.295069Z] || INFO: Scheduling Trial 268 (Experiment 268) (id: 0960a169-dc77-4d0c-bd28-041924182b4b)
[2023-09-21T00:34:25.589225Z] || INFO: Trial 268 (Experiment 268) was assigned to an agent
[2023-09-21T00:34:25.594080Z] 37d23717 || INFO: image already found, skipping pull phase: docker.io/zxy1998/wsn640:pigMage-pytoch1.10.2-ompi
[2023-09-21T00:34:25.616409Z] 37d23717 || INFO: copying files to container: /
[2023-09-21T00:34:25.666777Z] 63e6f494 || INFO: image already found, skipping pull phase: docker.io/zxy1998/wsn640:pigMage-pytoch1.10.2-ompi
[2023-09-21T00:34:25.683049Z] 63e6f494 || INFO: copying files to container: /
[2023-09-21T00:34:25.694207Z] 63e6f494 || INFO: copying files to container: /run/determined
[2023-09-21T00:34:25.703628Z] 63e6f494 || INFO: copying files to container: /
[2023-09-21T00:34:25.712071Z] 63e6f494 || INFO: copying files to container: /
[2023-09-21T00:34:25.718890Z] 63e6f494 || INFO: copying files to container: /
[2023-09-21T00:34:25.716124Z] 37d23717 || INFO: copying files to container: /run/determined
[2023-09-21T00:34:25.725118Z] 63e6f494 || INFO: copying files to container: /
[2023-09-21T00:34:25.731812Z] 63e6f494 || INFO: copying files to container: /
[2023-09-21T00:34:25.739570Z] 63e6f494 || INFO: copying files to container: /
[2023-09-21T00:34:25.796246Z] 37d23717 || INFO: copying files to container: /
[2023-09-21T00:34:25.856905Z] 37d23717 || INFO: copying files to container: /
[2023-09-21T00:34:25.926394Z] 37d23717 || INFO: copying files to container: /
[2023-09-21T00:34:25.985663Z] 37d23717 || INFO: copying files to container: /
[2023-09-21T00:34:26.084340Z] 37d23717 || INFO: copying files to container: /
[2023-09-21T00:34:26.167414Z] 37d23717 || INFO: copying files to container: /
[2023-09-21T00:34:26.462409Z] 37d23717 || INFO: Resources for Trial 268 (Experiment 268) have started
[2023-09-21T00:34:26.499189Z] 37d23717 ||
[2023-09-21T00:34:26.499346Z] 37d23717 || ==========
[2023-09-21T00:34:26.499381Z] 37d23717 || == CUDA ==
[2023-09-21T00:34:26.499578Z] 37d23717 || ==========
[2023-09-21T00:34:26.504997Z] 37d23717 ||
[2023-09-21T00:34:26.505109Z] 37d23717 || CUDA Version 11.3.1
[2023-09-21T00:34:26.507156Z] 37d23717 ||
[2023-09-21T00:34:26.507163Z] 37d23717 || Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
[2023-09-21T00:34:26.509425Z] 37d23717 ||
[2023-09-21T00:34:26.509431Z] 37d23717 || This container image and its contents are governed by the NVIDIA Deep Learning Container License.
[2023-09-21T00:34:26.509436Z] 37d23717 || By pulling and using the container, you accept the terms and conditions of this license:
[2023-09-21T00:34:26.509441Z] 37d23717 || https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
[2023-09-21T00:34:26.509447Z] 37d23717 ||
[2023-09-21T00:34:26.509452Z] 37d23717 || A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
[2023-09-21T00:34:26.528102Z] 37d23717 ||
[2023-09-21T00:34:27.156123Z] 63e6f494 || INFO: Resources for Trial 268 (Experiment 268) have started
[2023-09-21T00:34:27.160120Z] 63e6f494 ||
[2023-09-21T00:34:27.160169Z] 63e6f494 || ==========
[2023-09-21T00:34:27.160174Z] 63e6f494 || == CUDA ==
[2023-09-21T00:34:27.160632Z] 63e6f494 || ==========
[2023-09-21T00:34:27.164706Z] 63e6f494 ||
[2023-09-21T00:34:27.164762Z] 63e6f494 || CUDA Version 11.3.1
[2023-09-21T00:34:27.166361Z] 63e6f494 ||
[2023-09-21T00:34:27.166369Z] 63e6f494 || Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
[2023-09-21T00:34:27.167989Z] 63e6f494 ||
[2023-09-21T00:34:27.167991Z] 63e6f494 || This container image and its contents are governed by the NVIDIA Deep Learning Container License.
[2023-09-21T00:34:27.168005Z] 63e6f494 || By pulling and using the container, you accept the terms and conditions of this license:
[2023-09-21T00:34:27.168008Z] 63e6f494 || https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
[2023-09-21T00:34:27.168010Z] 63e6f494 ||
[2023-09-21T00:34:27.168012Z] 63e6f494 || A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
[2023-09-21T00:34:27.182121Z] 63e6f494 ||
[2023-09-21T00:34:29.553915Z] 37d23717 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[2023-09-21T00:34:30.365417Z] 63e6f494 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[2023-09-21T00:35:27.793792Z] 37d23717 || INFO: [43] root: detected 1 gpus
[2023-09-21T00:35:27.794276Z] 37d23717 || INFO: [43] root: Running task container on agent_id=determined-agent-0, hostname=wsn640-1 with visible GPUs ['GPU-6f1d5eb5-53d7-ebaa-6e21-113d1d122ce0']
[2023-09-21T00:35:27.856594Z] 37d23717 || + test -f startup-hook.sh
[2023-09-21T00:35:27.856606Z] 37d23717 || + set +x
[2023-09-21T00:35:32.646705Z] 63e6f494 || INFO: [43] root: detected 1 gpus
[2023-09-21T00:35:32.647235Z] 63e6f494 || INFO: [43] root: Running task container on agent_id=wsn640-2, hostname=wsn640-2 with visible GPUs ['GPU-e5946e97-eb13-29bb-d4da-01dd9101bd2d']
[2023-09-21T00:35:33.777840Z] 63e6f494 || + test -f startup-hook.sh
[2023-09-21T00:35:33.777863Z] 63e6f494 || + set +x
h1.10.2-ompi", "rocm": "determinedai/environments:rocm-5.0-pytorch-1.10-tf-2.7-rocm-6eceaca"}, "environment_variables": {"cpu": [], "cuda": [], "rocm": []}, "proxy_ports": [], "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": null, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"global_batch_size": {"type": "const", "val": 4}, "i": {"maxval": 4, "minval": 1, "type": "int"}}, "labels": [], "max_restarts": 0, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "rb-name", "optimizations": {"aggregation_frequency": 2, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "project": "", "records_per_epoch": 0, "reproducibility": {"experiment_seed": 1695256464}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "resource_pool": "default", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_concurrent_trials": 16, "max_length": {"batches": 1024}, "max_trials": 1, "metric": "loss", "name": "random", "smaller_is_better": true, "source_checkpoint_uuid": null, "source_trial_id": null}, "workspace": "", "slurm": {}, "pbs": {}}[2023-09-21T00:35:34.221918Z] 63e6f494 || INFO: [49] root: Validating checkpoint storage ...[2023-09-21T00:35:34.222941Z] 63e6f494 || INFO: [49] root: Launching: ['python3', '-m', 'determined.launch.horovod', '--autohorovod', '--trial', 'model_def:OneVarPytorchTrial']
[2023-09-21T00:35:34.215061Z] 37d23717 || INFO: [49] root: New trial runner in (container 37d23717-f4a4-436c-9dff-8e16b9b0d233) on agent determined-agent-0: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/root/.local/share/determined", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": "rb-pytorch-onevar", "entrypoint": "model_def:OneVarPytorchTrial", "environment": {"image": {"cpu": "determinedai/environments:py-3.8-pytorch-1.12-tf-2.11-cpu-6eceaca", "cuda": "zxy1998/wsn640:pigMage-pytoch1.10.2-ompi", "rocm": "determinedai/environments:rocm-5.0-pytorch-1.10-tf-2.7-rocm-6eceaca"}, "environment_variables": {"cpu": [], "cuda": [], "rocm": []}, "proxy_ports": [], "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": null, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"global_batch_size": {"type": "const", "val": 4}, "i": {"maxval": 4, "minval": 1, "type": "int"}}, "labels": [], "max_restarts": 0, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "rb-name", "optimizations": {"aggregation_frequency": 2, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "project": "", "records_per_epoch": 0, "reproducibility": {"experiment_seed": 1695256464}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "resource_pool": "default", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_concurrent_trials": 16, "max_length": {"batches": 1024}, "max_trials": 1, "metric": "loss", "name": "random", "smaller_is_better": true, "source_checkpoint_uuid": null, "source_trial_id": null}, "workspace": "", "slurm": {}, "pbs": {}}[2023-09-21T00:35:34.215073Z] 37d23717 || INFO: [49] root: Validating checkpoint storage ...[2023-09-21T00:35:34.215615Z] 37d23717 || INFO: [49] root: Launching: ['python3', '-m', 'determined.launch.horovod', '--autohorovod', '--trial', 'model_def:OneVarPytorchTrial']
[2023-09-21T00:39:58.992200Z] 63e6f494 || Warning: Permanently added '[192.168.123.83]:12350' (RSA) to the list of known hosts.
[2023-09-21T00:39:58.992215Z] 63e6f494 ||
[2023-09-21T00:40:06.101146Z] 37d23717 [rank=1] || rank 0
[2023-09-21T00:40:06.101157Z] 37d23717 [rank=1] || size 1
[2023-09-21T00:40:06.101162Z] 37d23717 [rank=1] || local_rank 0
[2023-09-21T00:40:06.101166Z] 37d23717 [rank=1] || local_size 1
[2023-09-21T00:40:06.101171Z] 37d23717 [rank=1] || cross_rank 0
[2023-09-21T00:40:06.101176Z] 37d23717 [rank=1] || cross_size 1
[2023-09-21T00:40:06.101315Z] 37d23717 [rank=1] || INFO: [94] root: Creating _PyTorchTrialController with OneVarPytorchTrial.
[2023-09-21T00:40:06.249983Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:06.558826Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=100, metrics={'loss': 0.9080761671066284, 'custom_metric': 2.5})
[2023-09-21T00:40:06.687340Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:06.700000Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=200, metrics={'loss': 0.743328332901001, 'custom_metric': 2.5})
[2023-09-21T00:40:06.818668Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:06.832929Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=300, metrics={'loss': 0.6084386706352234, 'custom_metric': 2.5})
[2023-09-21T00:40:06.949258Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:06.963984Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=400, metrics={'loss': 0.4980113208293915, 'custom_metric': 2.5})
[2023-09-21T00:40:07.086359Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:07.098676Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=500, metrics={'loss': 0.4076557159423828, 'custom_metric': 2.5})
[2023-09-21T00:40:07.221265Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:07.235117Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=600, metrics={'loss': 0.33367738127708435, 'custom_metric': 2.5})
[2023-09-21T00:40:07.237316Z] 63e6f494 [rank=0] || rank 0
[2023-09-21T00:40:07.237324Z] 63e6f494 [rank=0] || size 1
[2023-09-21T00:40:07.237332Z] 63e6f494 [rank=0] || local_rank 0
[2023-09-21T00:40:07.237337Z] 63e6f494 [rank=0] || local_size 1
[2023-09-21T00:40:07.237342Z] 63e6f494 [rank=0] || cross_rank 0
[2023-09-21T00:40:07.237346Z] 63e6f494 [rank=0] || cross_size 1
[2023-09-21T00:40:07.237501Z] 63e6f494 [rank=0] || INFO: [197] root: Creating _PyTorchTrialController with OneVarPytorchTrial.
[2023-09-21T00:40:07.369833Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:07.384608Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=700, metrics={'loss': 0.2731218636035919, 'custom_metric': 2.5})
[2023-09-21T00:40:07.393932Z] 63e6f494 [rank=0] || got 400 outputs
[2023-09-21T00:40:07.504618Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:07.519455Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=800, metrics={'loss': 0.22355258464813232, 'custom_metric': 2.5})
[2023-09-21T00:40:07.640500Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:07.655329Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=900, metrics={'loss': 0.18301747739315033, 'custom_metric': 2.5})
[2023-09-21T00:40:07.685031Z] 63e6f494 [rank=0] || INFO: [197] determined.core: report_training_metrics(steps_completed=100, metrics={'loss': 0.9080761671066284, 'custom_metric': 2.5})
[2023-09-21T00:40:07.705741Z] 63e6f494 [rank=0] || Traceback (most recent call last):
[2023-09-21T00:40:07.705745Z] 63e6f494 [rank=0] || File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[2023-09-21T00:40:07.705943Z] 63e6f494 [rank=0] || return _run_code(code, main_globals, None,
[2023-09-21T00:40:07.705946Z] 63e6f494 [rank=0] || File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
[2023-09-21T00:40:07.706072Z] 63e6f494 [rank=0] || exec(code, run_globals)
[2023-09-21T00:40:07.706075Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 208, in <module>
[2023-09-21T00:40:07.706248Z] 63e6f494 [rank=0] || sys.exit(main(args.train_entrypoint))
[2023-09-21T00:40:07.706250Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 43, in main
[2023-09-21T00:40:07.706354Z] 63e6f494 [rank=0] || return _run_pytorch_trial(trial_class, info)
[2023-09-21T00:40:07.706356Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 185, in _run_pytorch_trial
[2023-09-21T00:40:07.706505Z] 63e6f494 [rank=0] || trainer.fit(
[2023-09-21T00:40:07.706509Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_trainer.py", line 189, in fit
[2023-09-21T00:40:07.706713Z] 63e6f494 [rank=0] || trial_controller.run()
[2023-09-21T00:40:07.706717Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 615, in run
[2023-09-21T00:40:07.707046Z] 63e6f494 [rank=0] || self._run()
[2023-09-21T00:40:07.707049Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 650, in _run
[2023-09-21T00:40:07.707354Z] 63e6f494 [rank=0] || self._train_for_op(
[2023-09-21T00:40:07.707357Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 748, in _train_for_op
[2023-09-21T00:40:07.707700Z] 63e6f494 [rank=0] || metrics = self._aggregate_training_metrics(training_metrics)
[2023-09-21T00:40:07.707702Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 346, in _aggregate_training_metrics
[2023-09-21T00:40:07.707888Z] 63e6f494 [rank=0] || self.core_context.train.report_training_metrics(
[2023-09-21T00:40:07.707891Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/core/_train.py", line 95, in report_training_metrics
[2023-09-21T00:40:07.708003Z] 63e6f494 [rank=0] || self._session.post(
[2023-09-21T00:40:07.708005Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/common/api/_session.py", line 79, in post
[2023-09-21T00:40:07.708118Z] 63e6f494 [rank=0] || return self._do_request("POST", path, params, json, data, headers, timeout, False)
[2023-09-21T00:40:07.708120Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/common/api/_session.py", line 36, in _do_request
[2023-09-21T00:40:07.708211Z] 63e6f494 [rank=0] || return request.do_request(
[2023-09-21T00:40:07.708213Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/common/api/request.py", line 168, in do_request
[2023-09-21T00:40:07.708346Z] 63e6f494 [rank=0] || raise errors.APIException(r)
[2023-09-21T00:40:07.708388Z] 63e6f494 [rank=0] || determined.common.api.errors.APIException: {"error":{"code":13,"reason":"Internal","error":"failed to exec transaction (add training metrics): inserting metrics into raw_steps: ERROR: duplicate key value violates unique constraint \"steps_trial_id_total_batches_run_id_unique\" (SQLSTATE 23505)"}}
[2023-09-21T00:40:07.708391Z] 63e6f494 [rank=0] ||
[2023-09-21T00:40:07.792246Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:07.838419Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=1000, metrics={'loss': 0.1498124748468399, 'custom_metric': 2.5})
[2023-09-21T00:40:07.907605Z] 37d23717 [rank=1] || got 96 outputs
[2023-09-21T00:40:07.911235Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=1024, metrics={'loss': 0.13215720653533936, 'custom_metric': 2.5})
[2023-09-21T00:40:08.046530Z] 37d23717 [rank=1] || got 1024 outputs
[2023-09-21T00:40:08.046551Z] 37d23717 [rank=1] || INFO: [94] root: validated: 1024 records in 0.07591s (13490.0 records/s), in 256 batches (3373.0 batches/s)
[2023-09-21T00:40:08.107789Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_validation_metrics(steps_completed=1024, metrics={'loss': 0.12879968, 'custom_metric': 2.5})
[2023-09-21T00:40:08.213961Z] 37d23717 [rank=1] || INFO: [94] determined.core: Reported checkpoint to master cb890c7f-b9ef-497a-a724-46a0815204f9
[2023-09-21T00:40:09.214839Z] 63e6f494 || --------------------------------------------------------------------------
[2023-09-21T00:40:09.214857Z] 63e6f494 || Primary job terminated normally, but 1 process returned
[2023-09-21T00:40:09.214861Z] 63e6f494 || a non-zero exit code. Per user-direction, the job has been aborted.
[2023-09-21T00:40:09.214864Z] 63e6f494 || --------------------------------------------------------------------------
[2023-09-21T00:40:11.227703Z] 63e6f494 || --------------------------------------------------------------------------
[2023-09-21T00:40:11.227712Z] 63e6f494 || mpirun detected that one or more processes exited with non-zero status, thus causing
[2023-09-21T00:40:11.227715Z] 63e6f494 || the job to be terminated. The first process to do so was:
[2023-09-21T00:40:11.227718Z] 63e6f494 ||
[2023-09-21T00:40:11.227721Z] 63e6f494 || Process name: [[14234,1],0]
[2023-09-21T00:40:11.227723Z] 63e6f494 || Exit code: 1
[2023-09-21T00:40:11.227730Z] 63e6f494 || --------------------------------------------------------------------------
[2023-09-21T00:40:12.264690Z] 37d23717 || INFO: resources exited successfully with a zero exit code
[2023-09-21T00:40:13.400432Z] || ERROR: Trial 268 (Experiment 268) was terminated: allocation stopped after resources failed with non-zero exit code: container failed with non-zero exit code: 1 (exit code 1)
It looks like, as you say, the program does go inside horovodrun, but somehow PyTorchTrial does fail to recognize the fact that there are two workers that need to be coordinated.
Hi @taroTan1997, did you intend to close this issue? It looks like you've demonstrated a clear bug in our system, I think it should remain open. I still think I'll need to do more debugging with you as I don't know how to reproduce this on my own.
For starters, can you describe your cluster installation? Which subcommand of det deploy
are you using? When it's not so early for me, I can send you a script to run in place of your current script, to try to narrow down where the bug is occuring.
Thank you for your reply, I tried a different docker image using a rebuild image based on the official determinedai/environments:cuda-11.3-pytorch-1.12-tf-2.11-gpu-mpi-0.24.0 image, and the problem seems to be solved, the problem seems to be with the mpi module of the image that I built before.
thanks for following up. If you don't need the mpi image, then you should be good, right?
I'll get somebody to look into the mpi image anyway, this behavior doesn't seem right at all.
Apparently it is expected behavior for the mpi
image to not work except when running on a slurm or pbspro-backed cluster, which is only the case for some installations of the Determined Enterprise Edition. Or whatever the right HPE name is, there's too many acronyms I don't remember.
So we are discussing how we can make it clearer how to avoid this pitfall, and we are sorry it cost you time figuring it out, but it isn't actually a bug.
@taroTan1997 I actually have a followup question, which is: what path did you take to choose the mpi image? So I can understand where best to document the pitfalls.
Describe your question
I ran my own program using an image I built myself, and upon completion of the test I found the following problem that did not occur on a single GPU, when trying multiple GPUs (one GPU for each of the two machines)
determined.common.api.errors.APIException: {"error":{"code":13,"reason":"Internal","error":"failed to exec transaction (add training metrics): inserting metrics into raw_steps: ERROR: duplicate key value violates unique constraint \"steps_trial_id_total_batches_run_id_unique\" (SQLSTATE 23505)"}}
I tried searching for previous issues but did not find a specific solution, I am using version 0.22.2 of the determined version , using det deploy for deployment, in order to reproduce my error, I refer to a mod provided in the previous issue to try to reproduce my error, the error code as the following code reality
I'm a newbie and any tips and guidance would be greatly appreciated!
Checklist