determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
https://determined.ai
Apache License 2.0
3.01k stars 350 forks source link

🤔[question] duplicate key value violates unique constraint "steps_trial_id_total_batches_run_id_unique" #7939

Closed tzj-scau closed 1 year ago

tzj-scau commented 1 year ago

Describe your question

I ran my own program using an image I built myself, and upon completion of the test I found the following problem that did not occur on a single GPU, when trying multiple GPUs (one GPU for each of the two machines)

determined.common.api.errors.APIException: {"error":{"code":13,"reason":"Internal","error":"failed to exec transaction (add training metrics): inserting metrics into raw_steps: ERROR: duplicate key value violates unique constraint \"steps_trial_id_total_batches_run_id_unique\" (SQLSTATE 23505)"}}

I tried searching for previous issues but did not find a specific solution, I am using version 0.22.2 of the determined version , using det deploy for deployment, in order to reproduce my error, I refer to a mod provided in the previous issue to try to reproduce my error, the error code as the following code reality

name: rb-name
description: rb-pytorch-onevar
entrypoint: model_def:OneVarPytorchTrial
hyperparameters:
  global_batch_size: 4
  i:
    type: int
    minval: 1
    maxval: 4
searcher:
   name: random
   max_trials: 1
   metric: loss
   max_length:
     batches: 1024
   smaller_is_better: true

scheduling_unit: 100

max_restarts: 0

resources:
  slots_per_trial: 2

optimizations:
  average_training_metrics: false
  aggregation_frequency: 2

environment:
  image:
    # gpu: zxy1998/wsn640:pigMage-0.22.2
    # gpu: zxy1998/wsn640:pigMage-pytoch1.10-mpi
    gpu: zxy1998/wsn640:pigMage-pytoch1.10.2-ompi
from typing import Any, Dict, Sequence, Tuple, Union, cast, List, Callable

import torch
from torch import nn

from determined import pytorch

class OnesDataset(torch.utils.data.Dataset):
    def __len__(self) -> int:
        return 1024

    def __getitem__(self, index: int) -> torch.Tensor:
        return torch.Tensor([1.0])

def custom_reducer(outputs: List[Any]) -> Any:
    outputs = [o for output in outputs for o in output]
    print(f'got {len(outputs)} outputs')
    return 2.5

class OneVarPytorchTrial(pytorch.PyTorchTrial):
    def __init__(self, context: pytorch.PyTorchTrialContext) -> None:
        # context.env.experiment_config["records_per_epoch"] = 314
        self.context = context

        self.model = context.wrap_model(nn.Linear(1, 1, False))
        # initialize weights to 0
        self.model.weight.data.fill_(0)
        self.opt = context.wrap_optimizer(
            torch.optim.SGD(self.model.parameters(), lr=0.001), backward_passes_per_step=2
        )
        self.fnreducer = self.context.wrap_reducer(custom_reducer, "custom_metric")

    def train_batch(
        self, batch: pytorch.TorchData, epoch_idx: int, batch_idx: int
    ) -> Dict[str, torch.Tensor]:
        loss = torch.nn.MSELoss()(self.model(batch), batch)
        self.context.backward(loss)
        self.context.step_optimizer(self.opt)
        self.fnreducer.update(list(batch))
        return {"loss": loss}

    def evaluate_batch(self, batch: pytorch.TorchData, batch_idx: int) -> Dict[str, Any]:
        data = labels = batch
        loss = torch.nn.MSELoss()(self.model(data), labels)
        self.fnreducer.update(list(batch))
        return {"loss": loss}

    def build_training_data_loader(self) -> pytorch.DataLoader:
        return pytorch.DataLoader(
            OnesDataset(), batch_size=self.context.get_per_slot_batch_size()
        )

    def build_validation_data_loader(self) -> pytorch.DataLoader:
        return pytorch.DataLoader(
            OnesDataset(), batch_size=self.context.get_per_slot_batch_size()
        )
(base) PS A:\mage-main> det e create const.yaml . -f
Preparing files to send to master... 4.6MB and 50 files
Created experiment 266
Waiting for first trial to begin...
Following first trial with ID 266
[2023-09-20T12:12:24.146084Z]          || INFO: Scheduling Trial 266 (Experiment 266) (id: 13f0d348-8d7a-4aa3-ac08-cfeb50386346)
[2023-09-20T12:20:05.275485Z]          || INFO: Trial 266 (Experiment 266) was assigned to an agent
[2023-09-20T12:20:05.279254Z] 5abd5934 || INFO: image already found, skipping pull phase: docker.io/zxy1998/wsn640:pigMage-pytoch1.10.2-ompi
[2023-09-20T12:20:05.297539Z] 5abd5934 || INFO: copying files to container: /
[2023-09-20T12:20:05.382957Z] 5abd5934 || INFO: copying files to container: /run/determined
[2023-09-20T12:20:05.422194Z] 86d48338 || INFO: image already found, skipping pull phase: docker.io/zxy1998/wsn640:pigMage-pytoch1.10.2-ompi
[2023-09-20T12:20:05.437035Z] 86d48338 || INFO: copying files to container: /
[2023-09-20T12:20:05.442678Z] 86d48338 || INFO: copying files to container: /run/determined
[2023-09-20T12:20:05.449843Z] 86d48338 || INFO: copying files to container: /
[2023-09-20T12:20:05.457435Z] 86d48338 || INFO: copying files to container: /
[2023-09-20T12:20:05.463729Z] 86d48338 || INFO: copying files to container: /
[2023-09-20T12:20:05.469052Z] 5abd5934 || INFO: copying files to container: /
[2023-09-20T12:20:05.470159Z] 86d48338 || INFO: copying files to container: /
[2023-09-20T12:20:05.476239Z] 86d48338 || INFO: copying files to container: /
[2023-09-20T12:20:05.483279Z] 86d48338 || INFO: copying files to container: /
[2023-09-20T12:20:05.558921Z] 5abd5934 || INFO: copying files to container: /
[2023-09-20T12:20:05.622343Z] 5abd5934 || INFO: copying files to container: /
[2023-09-20T12:20:05.715519Z] 5abd5934 || INFO: copying files to container: /
[2023-09-20T12:20:05.777951Z] 5abd5934 || INFO: copying files to container: /
[2023-09-20T12:20:05.862410Z] 5abd5934 || INFO: copying files to container: /
[2023-09-20T12:20:06.213706Z] 5abd5934 || INFO: Resources for Trial 266 (Experiment 266) have started
[2023-09-20T12:20:06.213999Z] 5abd5934 ||
[2023-09-20T12:20:06.214117Z] 5abd5934 || ==========
[2023-09-20T12:20:06.214171Z] 5abd5934 || == CUDA ==
[2023-09-20T12:20:06.214276Z] 5abd5934 || ==========
[2023-09-20T12:20:06.220007Z] 5abd5934 ||
[2023-09-20T12:20:06.220047Z] 5abd5934 || CUDA Version 11.3.1
[2023-09-20T12:20:06.221602Z] 5abd5934 ||
[2023-09-20T12:20:06.221607Z] 5abd5934 || Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
[2023-09-20T12:20:06.223653Z] 5abd5934 ||
[2023-09-20T12:20:06.223659Z] 5abd5934 || This container image and its contents are governed by the NVIDIA Deep Learning Container License.
[2023-09-20T12:20:06.223664Z] 5abd5934 || By pulling and using the container, you accept the terms and conditions of this license:
[2023-09-20T12:20:06.223669Z] 5abd5934 || https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
[2023-09-20T12:20:06.223674Z] 5abd5934 ||
[2023-09-20T12:20:06.223679Z] 5abd5934 || A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
[2023-09-20T12:20:06.243968Z] 5abd5934 ||
[2023-09-20T12:20:06.924063Z] 86d48338 || 
[2023-09-20T12:20:06.924330Z] 86d48338 || ==========
[2023-09-20T12:20:06.924338Z] 86d48338 || == CUDA ==
[2023-09-20T12:20:06.924528Z] 86d48338 || ==========
[2023-09-20T12:20:06.930949Z] 86d48338 || INFO: Resources for Trial 266 (Experiment 266) have started
[2023-09-20T12:20:06.930083Z] 86d48338 ||
[2023-09-20T12:20:06.930087Z] 86d48338 || CUDA Version 11.3.1
[2023-09-20T12:20:06.932329Z] 86d48338 ||
[2023-09-20T12:20:06.932333Z] 86d48338 || Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
[2023-09-20T12:20:06.934178Z] 86d48338 || 
[2023-09-20T12:20:06.934183Z] 86d48338 || This container image and its contents are governed by the NVIDIA Deep Learning Container License.
[2023-09-20T12:20:06.934186Z] 86d48338 || By pulling and using the container, you accept the terms and conditions of this license:
[2023-09-20T12:20:06.934188Z] 86d48338 || https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
[2023-09-20T12:20:06.934190Z] 86d48338 || 
[2023-09-20T12:20:06.934192Z] 86d48338 || A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
[2023-09-20T12:20:06.949949Z] 86d48338 || 
[2023-09-20T12:20:09.355306Z] 5abd5934 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[2023-09-20T12:20:10.112857Z] 86d48338 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[2023-09-20T12:21:07.867141Z] 5abd5934 || INFO: [43] root: detected 1 gpus
[2023-09-20T12:21:07.867648Z] 5abd5934 || INFO: [43] root: Running task container on agent_id=determined-agent-0, hostname=wsn640-1 with visible GPUs ['GPU-6f1d5eb5-53d7-ebaa-6e21-113d1d122ce0']
[2023-09-20T12:21:07.917117Z] 5abd5934 || + test -f startup-hook.sh
[2023-09-20T12:21:07.917135Z] 5abd5934 || + set +x
[2023-09-20T12:21:11.965910Z] 86d48338 || INFO: [43] root: detected 1 gpus
[2023-09-20T12:21:11.966612Z] 86d48338 || INFO: [43] root: Running task container on agent_id=wsn640-2, hostname=wsn640-2 with visible GPUs ['GPU-e5946e97-eb13-29bb-d4da-01dd9101bd2d']
[2023-09-20T12:21:13.145995Z] 86d48338 || + test -f startup-hook.sh
[2023-09-20T12:21:13.146008Z] 86d48338 || + set +x
[2023-09-20T12:21:13.583150Z] 5abd5934 || INFO: [49] root: New trial runner in (container 5abd5934-6326-49ff-a857-3e73fdde181b) on agent determined-agent-0: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/root/.local/share/determined", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": "rb-pytorch-onevar", "entrypoint": "model_def:OneVarPytorchTrial", "environment": {"image": {"cpu": "determinedai/environments:py-3.8-pytorch-1.12-tf-2.11-cpu-6eceaca", "cuda": "zxy1998/wsn640:pigMage-pytoch1.10.2-ompi", "rocm": "determinedai/environments:rocm-5.0-pytorch-1.10-tf-2.7-rocm-6eceaca"}, "environment_variables": {"cpu": [], "cuda": [], "rocm": []}, "proxy_ports": [], "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": null, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"global_batch_size": {"type": "const", "val": 4}, "i": {"maxval": 4, "minval": 1, "type": "int"}}, "labels": [], "max_restarts": 0, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "rb-name", "optimizations": {"aggregation_frequency": 2, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "project": "", "records_per_epoch": 0, "reproducibility": {"experiment_seed": 1695211943}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "resource_pool": "default", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_concurrent_trials": 16, "max_length": {"batches": 1024}, "max_trials": 1, "metric": "loss", "name": "random", "smaller_is_better": true, "source_checkpoint_uuid": null, "source_trial_id": null}, "workspace": "", "slurm": {}, "pbs": {}} 
[2023-09-20T12:21:13.583168Z] 5abd5934 || INFO: [49] root: Validating checkpoint storage ...
"}}, "labels": [], "max_restarts": 0, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "rb-name", "optimizations": {"aggregation_frequency": 2, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "project": "", "records_per_epoch": 0, "reproducibility": {"experiment_seed": 1695211943}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "resource_pool": "default", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_concurrent_trials": 16, "max_length": {"batches": 1024}, "max_trials": 1, "metric": "loss", "name": "random", "smaller_is_better": true, "source_checkpoint_uuid": null, "source_trial_id": null}, "workspace": "", "slurm": {}, "pbs": {}}                                                     [2023-09-20T12:21:13.601214Z] 86d48338 || INFO: [49] root: Validating checkpoint storage ...                            [2023-09-20T12:21:13.601804Z] 86d48338 || INFO: [49] root: Launching: ['python3', '-m', 'determined.launch.horovod', '--autohorovod', '--trial', 'model_def:OneVarPytorchTrial']
[2023-09-20T12:25:38.720517Z] 5abd5934 || Warning: Permanently added '[192.168.123.52]:12350' (RSA) to the list of known hosts.
[2023-09-20T12:25:38.720527Z] 5abd5934 || 
[2023-09-20T12:25:46.015498Z] 5abd5934 [rank=0] || INFO: [198] root: Creating _PyTorchTrialController with OneVarPytorchTrial.
[2023-09-20T12:25:46.156717Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:46.470743Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=100, metrics={'loss': 0.9080761671066284, 'custom_metric': 2.5})
[2023-09-20T12:25:46.595768Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:46.607934Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=200, metrics={'loss': 0.743328332901001, 'custom_metric': 2.5})
[2023-09-20T12:25:46.720786Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:46.734224Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=300, metrics={'loss': 0.6084386706352234, 'custom_metric': 2.5})
[2023-09-20T12:25:46.855203Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:46.867799Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=400, metrics={'loss': 0.4980113208293915, 'custom_metric': 2.5})
[2023-09-20T12:25:46.978375Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:46.991437Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=500, metrics={'loss': 0.4076557159423828, 'custom_metric': 2.5})
[2023-09-20T12:25:47.102772Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:47.116789Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=600, metrics={'loss': 0.33367738127708435, 'custom_metric': 2.5})
[2023-09-20T12:25:47.234260Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:47.248523Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=700, metrics={'loss': 0.2731218636035919, 'custom_metric': 2.5})
[2023-09-20T12:25:47.273350Z] 86d48338 [rank=1] || INFO: [96] root: Creating _PyTorchTrialController with OneVarPytorchTrial.
[2023-09-20T12:25:47.363643Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:47.381739Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=800, metrics={'loss': 0.22355258464813232, 'custom_metric': 2.5})
[2023-09-20T12:25:47.472577Z] 86d48338 [rank=1] || got 400 outputs
[2023-09-20T12:25:47.496099Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:47.511928Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=900, metrics={'loss': 0.18301747739315033, 'custom_metric': 2.5})
[2023-09-20T12:25:47.624951Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:47.637577Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=1000, metrics={'loss': 0.1498124748468399, 'custom_metric': 2.5})
[2023-09-20T12:25:47.706644Z] 5abd5934 [rank=0] || got 96 outputs
[2023-09-20T12:25:47.726261Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=1024, metrics={'loss': 0.13215720653533936, 'custom_metric': 2.5})
[2023-09-20T12:25:47.782496Z] 86d48338 [rank=1] || INFO: [96] determined.core: report_training_metrics(steps_completed=100, metrics={'loss': 0.9080761671066284, 'custom_metric': 2.5})
[2023-09-20T12:25:47.821201Z] 86d48338 [rank=1] || Traceback (most recent call last):
[2023-09-20T12:25:47.821214Z] 86d48338 [rank=1] ||   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[2023-09-20T12:25:47.821219Z] 86d48338 [rank=1] ||     return _run_code(code, main_globals, None,
[2023-09-20T12:25:47.821224Z] 86d48338 [rank=1] ||   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
[2023-09-20T12:25:47.821244Z] 86d48338 [rank=1] ||     exec(code, run_globals)      
[2023-09-20T12:25:47.821256Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 208, in <module>
[2023-09-20T12:25:47.821340Z] 86d48338 [rank=1] ||     sys.exit(main(args.train_entrypoint))
[2023-09-20T12:25:47.821344Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 43, in main     
[2023-09-20T12:25:47.821371Z] 86d48338 [rank=1] ||     return _run_pytorch_trial(trial_class, info)
[2023-09-20T12:25:47.821375Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 185, in _run_pytorch_trial
[2023-09-20T12:25:47.821441Z] 86d48338 [rank=1] ||     trainer.fit(
[2023-09-20T12:25:47.821444Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_trainer.py", line 189, in fit 
[2023-09-20T12:25:47.821549Z] 86d48338 [rank=1] ||     trial_controller.run()       
[2023-09-20T12:25:47.821555Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 615, in run
[2023-09-20T12:25:47.821676Z] 86d48338 [rank=1] ||     self._run()
[2023-09-20T12:25:47.821685Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 650, in _run
[2023-09-20T12:25:47.821812Z] 86d48338 [rank=1] ||     self._train_for_op(
[2023-09-20T12:25:47.821814Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 748, in _train_for_op
[2023-09-20T12:25:47.821964Z] 86d48338 [rank=1] ||     metrics = self._aggregate_training_metrics(training_metrics)
[2023-09-20T12:25:47.821967Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 346, in _aggregate_training_metrics
[2023-09-20T12:25:47.822056Z] 86d48338 [rank=1] ||     self.core_context.train.report_training_metrics(
[2023-09-20T12:25:47.822059Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/[2023-09-20T12:25:47.822241Z] 86d48338 [rank=1] || determined.common.api.errors.APIException: {"error":{"code":13,"reason":"Internal","error":"failed to exec transaction (add training metrics): inserting metrics into raw_steps: ERROR: duplicate key value violates unique constraint \"steps_trial_id_total_batches_run_id_unique\" (SQLSTATE 23505)"}}      [2023-09-20T12:25:47.822250Z] 86d48338 [rank=1] ||
[2023-09-20T12:25:47.865924Z] 5abd5934 [rank=0] || got 1024 outputs                                                   [2023-09-20T12:25:47.865936Z] 5abd5934 [rank=0] || INFO: [198] root: validated: 1024 records in 0.07906s (12952.0 records/s), in 256 batches (3238.0 batches/s)                         [2023-09-20T12:25:47.895921Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_validation_metrics(steps_completed=1024, metrics={'loss': 0.12879968, 'custom_metric': 2.5})
[2023-09-20T12:25:47.999638Z] 5abd5934 [rank=0] || INFO: [198] determined.core: Reported checkpoint to master 7aae78b5-019c-4a35-a189-28020fc196d9
[2023-09-20T12:25:48.844431Z] 5abd5934 || --------------------------------------------------------------------------  
[2023-09-20T12:25:48.844440Z] 5abd5934 || Primary job  terminated normally, but 1 process returned                    [2023-09-20T12:25:48.844446Z] 5abd5934 || a non-zero exit code. Per user-direction, the job has been aborted.         [2023-09-20T12:25:48.844449Z] 5abd5934 || --------------------------------------------------------------------------  
[2023-09-20T12:25:52.237399Z] 5abd5934 || --------------------------------------------------------------------------  
[2023-09-20T12:25:52.237405Z] 5abd5934 || mpirun detected that one or more processes exited with non-zero status, thus causing                                                                                                              [2023-09-20T12:25:52.237409Z] 5abd5934 || the job to be terminated. The first process to do so was:
[2023-09-20T12:25:52.237414Z] 5abd5934 ||                                                                             [2023-09-20T12:25:52.237420Z] 5abd5934 ||   Process name: [[2293,1],1]
[2023-09-20T12:25:52.237426Z] 5abd5934 ||   Exit code:    1                                                           [2023-09-20T12:25:52.237431Z] 5abd5934 || --------------------------------------------------------------------------  
[2023-09-20T12:25:52.471027Z]          || INFO: forcibly killing allocation's remaining resources (reason: resources failed with non-zero exit code: container failed with non-zero exit code: 1 (exit code 1))
[2023-09-20T12:25:52.637204Z]          || DEBUG: Trial 266 (Experiment 266) was terminated: allocation stopped after resources failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)
Trial log stream ended. To reopen log stream, run: det trial logs -f 266

I'm a newbie and any tips and guidance would be greatly appreciated!

Checklist

rb-determined-ai commented 1 year ago

The duplicate key problem is real. You have two containers, but one is reporting metrics before the other even says Creating _PyTorchTrialController with OneVarPytorchTrial.

The first time the second container reports metrics, you get the error.

Since I see the [rank=N] bits in the logs, and since you are using the legacy entrypoint format (that is, model_def:TrialClass), that means you are definitely inside of horovodrun, but somehow your PyTorchTrial isn't recognizing the fact that there's two workers that need to coordinate.

Have you modified the determined library by chance? This strikes me as an impossible bug.

If you haven't modified it, please add a print statement in your OneVarTrial.__init__():

print('rank', context.distributed.rank)
print('size', context.distributed.size)
print('local_rank', context.distributed.local_rank)
print('local_size', context.distributed.local_size)
print('cross_rank', context.distributed.cross_rank)
print('cross_size', context.distributed.cross_size)

and share the resulting logs.

tzj-scau commented 1 year ago

Thank you for your reply, I didn't modify the determined library, I added the following code to the init as you requested

(base) PS A:\mage-main> det e create const.yaml . -f
Preparing files to send to master... 4.6MB and 50 files  
Created experiment 268
Waiting for first trial to begin...
Following first trial with ID 268
[2023-09-21T00:34:25.295069Z]          || INFO: Scheduling Trial 268 (Experiment 268) (id: 0960a169-dc77-4d0c-bd28-041924182b4b)
[2023-09-21T00:34:25.589225Z]          || INFO: Trial 268 (Experiment 268) was assigned to an agent
[2023-09-21T00:34:25.594080Z] 37d23717 || INFO: image already found, skipping pull phase: docker.io/zxy1998/wsn640:pigMage-pytoch1.10.2-ompi
[2023-09-21T00:34:25.616409Z] 37d23717 || INFO: copying files to container: /
[2023-09-21T00:34:25.666777Z] 63e6f494 || INFO: image already found, skipping pull phase: docker.io/zxy1998/wsn640:pigMage-pytoch1.10.2-ompi
[2023-09-21T00:34:25.683049Z] 63e6f494 || INFO: copying files to container: /
[2023-09-21T00:34:25.694207Z] 63e6f494 || INFO: copying files to container: /run/determined
[2023-09-21T00:34:25.703628Z] 63e6f494 || INFO: copying files to container: /
[2023-09-21T00:34:25.712071Z] 63e6f494 || INFO: copying files to container: /
[2023-09-21T00:34:25.718890Z] 63e6f494 || INFO: copying files to container: /
[2023-09-21T00:34:25.716124Z] 37d23717 || INFO: copying files to container: /run/determined
[2023-09-21T00:34:25.725118Z] 63e6f494 || INFO: copying files to container: /
[2023-09-21T00:34:25.731812Z] 63e6f494 || INFO: copying files to container: /
[2023-09-21T00:34:25.739570Z] 63e6f494 || INFO: copying files to container: /
[2023-09-21T00:34:25.796246Z] 37d23717 || INFO: copying files to container: /
[2023-09-21T00:34:25.856905Z] 37d23717 || INFO: copying files to container: /
[2023-09-21T00:34:25.926394Z] 37d23717 || INFO: copying files to container: /
[2023-09-21T00:34:25.985663Z] 37d23717 || INFO: copying files to container: /
[2023-09-21T00:34:26.084340Z] 37d23717 || INFO: copying files to container: /
[2023-09-21T00:34:26.167414Z] 37d23717 || INFO: copying files to container: /
[2023-09-21T00:34:26.462409Z] 37d23717 || INFO: Resources for Trial 268 (Experiment 268) have started
[2023-09-21T00:34:26.499189Z] 37d23717 ||
[2023-09-21T00:34:26.499346Z] 37d23717 || ==========
[2023-09-21T00:34:26.499381Z] 37d23717 || == CUDA ==
[2023-09-21T00:34:26.499578Z] 37d23717 || ==========
[2023-09-21T00:34:26.504997Z] 37d23717 ||
[2023-09-21T00:34:26.505109Z] 37d23717 || CUDA Version 11.3.1
[2023-09-21T00:34:26.507156Z] 37d23717 ||
[2023-09-21T00:34:26.507163Z] 37d23717 || Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
[2023-09-21T00:34:26.509425Z] 37d23717 ||
[2023-09-21T00:34:26.509431Z] 37d23717 || This container image and its contents are governed by the NVIDIA Deep Learning Container License.
[2023-09-21T00:34:26.509436Z] 37d23717 || By pulling and using the container, you accept the terms and conditions of this license:
[2023-09-21T00:34:26.509441Z] 37d23717 || https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
[2023-09-21T00:34:26.509447Z] 37d23717 ||
[2023-09-21T00:34:26.509452Z] 37d23717 || A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
[2023-09-21T00:34:26.528102Z] 37d23717 ||
[2023-09-21T00:34:27.156123Z] 63e6f494 || INFO: Resources for Trial 268 (Experiment 268) have started
[2023-09-21T00:34:27.160120Z] 63e6f494 ||
[2023-09-21T00:34:27.160169Z] 63e6f494 || ==========
[2023-09-21T00:34:27.160174Z] 63e6f494 || == CUDA ==
[2023-09-21T00:34:27.160632Z] 63e6f494 || ==========
[2023-09-21T00:34:27.164706Z] 63e6f494 ||
[2023-09-21T00:34:27.164762Z] 63e6f494 || CUDA Version 11.3.1
[2023-09-21T00:34:27.166361Z] 63e6f494 ||
[2023-09-21T00:34:27.166369Z] 63e6f494 || Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
[2023-09-21T00:34:27.167989Z] 63e6f494 ||
[2023-09-21T00:34:27.167991Z] 63e6f494 || This container image and its contents are governed by the NVIDIA Deep Learning Container License.
[2023-09-21T00:34:27.168005Z] 63e6f494 || By pulling and using the container, you accept the terms and conditions of this license:
[2023-09-21T00:34:27.168008Z] 63e6f494 || https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
[2023-09-21T00:34:27.168010Z] 63e6f494 ||
[2023-09-21T00:34:27.168012Z] 63e6f494 || A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
[2023-09-21T00:34:27.182121Z] 63e6f494 ||
[2023-09-21T00:34:29.553915Z] 37d23717 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[2023-09-21T00:34:30.365417Z] 63e6f494 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[2023-09-21T00:35:27.793792Z] 37d23717 || INFO: [43] root: detected 1 gpus
[2023-09-21T00:35:27.794276Z] 37d23717 || INFO: [43] root: Running task container on agent_id=determined-agent-0, hostname=wsn640-1 with visible GPUs ['GPU-6f1d5eb5-53d7-ebaa-6e21-113d1d122ce0']
[2023-09-21T00:35:27.856594Z] 37d23717 || + test -f startup-hook.sh
[2023-09-21T00:35:27.856606Z] 37d23717 || + set +x
[2023-09-21T00:35:32.646705Z] 63e6f494 || INFO: [43] root: detected 1 gpus
[2023-09-21T00:35:32.647235Z] 63e6f494 || INFO: [43] root: Running task container on agent_id=wsn640-2, hostname=wsn640-2 with visible GPUs ['GPU-e5946e97-eb13-29bb-d4da-01dd9101bd2d']
[2023-09-21T00:35:33.777840Z] 63e6f494 || + test -f startup-hook.sh
[2023-09-21T00:35:33.777863Z] 63e6f494 || + set +x
h1.10.2-ompi", "rocm": "determinedai/environments:rocm-5.0-pytorch-1.10-tf-2.7-rocm-6eceaca"}, "environment_variables": {"cpu": [], "cuda": [], "rocm": []}, "proxy_ports": [], "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": null, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"global_batch_size": {"type": "const", "val": 4}, "i": {"maxval": 4, "minval": 1, "type": "int"}}, "labels": [], "max_restarts": 0, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "rb-name", "optimizations": {"aggregation_frequency": 2, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "project": "", "records_per_epoch": 0, "reproducibility": {"experiment_seed": 1695256464}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "resource_pool": "default", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_concurrent_trials": 16, "max_length": {"batches": 1024}, "max_trials": 1, "metric": "loss", "name": "random", "smaller_is_better": true, "source_checkpoint_uuid": null, "source_trial_id": null}, "workspace": "", "slurm": {}, "pbs": {}}[2023-09-21T00:35:34.221918Z] 63e6f494 || INFO: [49] root: Validating checkpoint storage ...[2023-09-21T00:35:34.222941Z] 63e6f494 || INFO: [49] root: Launching: ['python3', '-m', 'determined.launch.horovod', '--autohorovod', '--trial', 'model_def:OneVarPytorchTrial']
[2023-09-21T00:35:34.215061Z] 37d23717 || INFO: [49] root: New trial runner in (container 37d23717-f4a4-436c-9dff-8e16b9b0d233) on agent determined-agent-0: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/root/.local/share/determined", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": "rb-pytorch-onevar", "entrypoint": "model_def:OneVarPytorchTrial", "environment": {"image": {"cpu": "determinedai/environments:py-3.8-pytorch-1.12-tf-2.11-cpu-6eceaca", "cuda": "zxy1998/wsn640:pigMage-pytoch1.10.2-ompi", "rocm": "determinedai/environments:rocm-5.0-pytorch-1.10-tf-2.7-rocm-6eceaca"}, "environment_variables": {"cpu": [], "cuda": [], "rocm": []}, "proxy_ports": [], "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": null, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"global_batch_size": {"type": "const", "val": 4}, "i": {"maxval": 4, "minval": 1, "type": "int"}}, "labels": [], "max_restarts": 0, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "rb-name", "optimizations": {"aggregation_frequency": 2, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "project": "", "records_per_epoch": 0, "reproducibility": {"experiment_seed": 1695256464}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "resource_pool": "default", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_concurrent_trials": 16, "max_length": {"batches": 1024}, "max_trials": 1, "metric": "loss", "name": "random", "smaller_is_better": true, "source_checkpoint_uuid": null, "source_trial_id": null}, "workspace": "", "slurm": {}, "pbs": {}}[2023-09-21T00:35:34.215073Z] 37d23717 || INFO: [49] root: Validating checkpoint storage ...[2023-09-21T00:35:34.215615Z] 37d23717 || INFO: [49] root: Launching: ['python3', '-m', 'determined.launch.horovod', '--autohorovod', '--trial', 'model_def:OneVarPytorchTrial']
[2023-09-21T00:39:58.992200Z] 63e6f494 || Warning: Permanently added '[192.168.123.83]:12350' (RSA) to the list of known hosts.
[2023-09-21T00:39:58.992215Z] 63e6f494 || 
[2023-09-21T00:40:06.101146Z] 37d23717 [rank=1] || rank 0
[2023-09-21T00:40:06.101157Z] 37d23717 [rank=1] || size 1
[2023-09-21T00:40:06.101162Z] 37d23717 [rank=1] || local_rank 0
[2023-09-21T00:40:06.101166Z] 37d23717 [rank=1] || local_size 1
[2023-09-21T00:40:06.101171Z] 37d23717 [rank=1] || cross_rank 0
[2023-09-21T00:40:06.101176Z] 37d23717 [rank=1] || cross_size 1
[2023-09-21T00:40:06.101315Z] 37d23717 [rank=1] || INFO: [94] root: Creating _PyTorchTrialController with OneVarPytorchTrial.
[2023-09-21T00:40:06.249983Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:06.558826Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=100, metrics={'loss': 0.9080761671066284, 'custom_metric': 2.5})
[2023-09-21T00:40:06.687340Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:06.700000Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=200, metrics={'loss': 0.743328332901001, 'custom_metric': 2.5})
[2023-09-21T00:40:06.818668Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:06.832929Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=300, metrics={'loss': 0.6084386706352234, 'custom_metric': 2.5})
[2023-09-21T00:40:06.949258Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:06.963984Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=400, metrics={'loss': 0.4980113208293915, 'custom_metric': 2.5})
[2023-09-21T00:40:07.086359Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:07.098676Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=500, metrics={'loss': 0.4076557159423828, 'custom_metric': 2.5})
[2023-09-21T00:40:07.221265Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:07.235117Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=600, metrics={'loss': 0.33367738127708435, 'custom_metric': 2.5})
[2023-09-21T00:40:07.237316Z] 63e6f494 [rank=0] || rank 0
[2023-09-21T00:40:07.237324Z] 63e6f494 [rank=0] || size 1
[2023-09-21T00:40:07.237332Z] 63e6f494 [rank=0] || local_rank 0
[2023-09-21T00:40:07.237337Z] 63e6f494 [rank=0] || local_size 1
[2023-09-21T00:40:07.237342Z] 63e6f494 [rank=0] || cross_rank 0
[2023-09-21T00:40:07.237346Z] 63e6f494 [rank=0] || cross_size 1
[2023-09-21T00:40:07.237501Z] 63e6f494 [rank=0] || INFO: [197] root: Creating _PyTorchTrialController with OneVarPytorchTrial.
[2023-09-21T00:40:07.369833Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:07.384608Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=700, metrics={'loss': 0.2731218636035919, 'custom_metric': 2.5})
[2023-09-21T00:40:07.393932Z] 63e6f494 [rank=0] || got 400 outputs
[2023-09-21T00:40:07.504618Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:07.519455Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=800, metrics={'loss': 0.22355258464813232, 'custom_metric': 2.5})
[2023-09-21T00:40:07.640500Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:07.655329Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=900, metrics={'loss': 0.18301747739315033, 'custom_metric': 2.5})
[2023-09-21T00:40:07.685031Z] 63e6f494 [rank=0] || INFO: [197] determined.core: report_training_metrics(steps_completed=100, metrics={'loss': 0.9080761671066284, 'custom_metric': 2.5})
[2023-09-21T00:40:07.705741Z] 63e6f494 [rank=0] || Traceback (most recent call last):
[2023-09-21T00:40:07.705745Z] 63e6f494 [rank=0] ||   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[2023-09-21T00:40:07.705943Z] 63e6f494 [rank=0] ||     return _run_code(code, main_globals, None,
[2023-09-21T00:40:07.705946Z] 63e6f494 [rank=0] ||   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
[2023-09-21T00:40:07.706072Z] 63e6f494 [rank=0] ||     exec(code, run_globals)
[2023-09-21T00:40:07.706075Z] 63e6f494 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 208, in <module>
[2023-09-21T00:40:07.706248Z] 63e6f494 [rank=0] ||     sys.exit(main(args.train_entrypoint))
[2023-09-21T00:40:07.706250Z] 63e6f494 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 43, in main
[2023-09-21T00:40:07.706354Z] 63e6f494 [rank=0] ||     return _run_pytorch_trial(trial_class, info)
[2023-09-21T00:40:07.706356Z] 63e6f494 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 185, in _run_pytorch_trial
[2023-09-21T00:40:07.706505Z] 63e6f494 [rank=0] ||     trainer.fit(
[2023-09-21T00:40:07.706509Z] 63e6f494 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_trainer.py", line 189, in fit
[2023-09-21T00:40:07.706713Z] 63e6f494 [rank=0] ||     trial_controller.run()
[2023-09-21T00:40:07.706717Z] 63e6f494 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 615, in run
[2023-09-21T00:40:07.707046Z] 63e6f494 [rank=0] ||     self._run()
[2023-09-21T00:40:07.707049Z] 63e6f494 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 650, in _run
[2023-09-21T00:40:07.707354Z] 63e6f494 [rank=0] ||     self._train_for_op(
[2023-09-21T00:40:07.707357Z] 63e6f494 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 748, in _train_for_op
[2023-09-21T00:40:07.707700Z] 63e6f494 [rank=0] ||     metrics = self._aggregate_training_metrics(training_metrics)
[2023-09-21T00:40:07.707702Z] 63e6f494 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 346, in _aggregate_training_metrics
[2023-09-21T00:40:07.707888Z] 63e6f494 [rank=0] ||     self.core_context.train.report_training_metrics(
[2023-09-21T00:40:07.707891Z] 63e6f494 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/core/_train.py", line 95, in report_training_metrics
[2023-09-21T00:40:07.708003Z] 63e6f494 [rank=0] ||     self._session.post(
[2023-09-21T00:40:07.708005Z] 63e6f494 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/common/api/_session.py", line 79, in post
[2023-09-21T00:40:07.708118Z] 63e6f494 [rank=0] ||     return self._do_request("POST", path, params, json, data, headers, timeout, False)
[2023-09-21T00:40:07.708120Z] 63e6f494 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/common/api/_session.py", line 36, in _do_request
[2023-09-21T00:40:07.708211Z] 63e6f494 [rank=0] ||     return request.do_request(
[2023-09-21T00:40:07.708213Z] 63e6f494 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/common/api/request.py", line 168, in do_request
[2023-09-21T00:40:07.708346Z] 63e6f494 [rank=0] ||     raise errors.APIException(r)
[2023-09-21T00:40:07.708388Z] 63e6f494 [rank=0] || determined.common.api.errors.APIException: {"error":{"code":13,"reason":"Internal","error":"failed to exec transaction (add training metrics): inserting metrics into raw_steps: ERROR: duplicate key value violates unique constraint \"steps_trial_id_total_batches_run_id_unique\" (SQLSTATE 23505)"}}
[2023-09-21T00:40:07.708391Z] 63e6f494 [rank=0] ||
[2023-09-21T00:40:07.792246Z] 37d23717 [rank=1] || got 400 outputs
[2023-09-21T00:40:07.838419Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=1000, metrics={'loss': 0.1498124748468399, 'custom_metric': 2.5})
[2023-09-21T00:40:07.907605Z] 37d23717 [rank=1] || got 96 outputs
[2023-09-21T00:40:07.911235Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=1024, metrics={'loss': 0.13215720653533936, 'custom_metric': 2.5})
[2023-09-21T00:40:08.046530Z] 37d23717 [rank=1] || got 1024 outputs
[2023-09-21T00:40:08.046551Z] 37d23717 [rank=1] || INFO: [94] root: validated: 1024 records in 0.07591s (13490.0 records/s), in 256 batches (3373.0 batches/s)
[2023-09-21T00:40:08.107789Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_validation_metrics(steps_completed=1024, metrics={'loss': 0.12879968, 'custom_metric': 2.5})
[2023-09-21T00:40:08.213961Z] 37d23717 [rank=1] || INFO: [94] determined.core: Reported checkpoint to master cb890c7f-b9ef-497a-a724-46a0815204f9
[2023-09-21T00:40:09.214839Z] 63e6f494 || --------------------------------------------------------------------------
[2023-09-21T00:40:09.214857Z] 63e6f494 || Primary job  terminated normally, but 1 process returned
[2023-09-21T00:40:09.214861Z] 63e6f494 || a non-zero exit code. Per user-direction, the job has been aborted.
[2023-09-21T00:40:09.214864Z] 63e6f494 || --------------------------------------------------------------------------
[2023-09-21T00:40:11.227703Z] 63e6f494 || --------------------------------------------------------------------------
[2023-09-21T00:40:11.227712Z] 63e6f494 || mpirun detected that one or more processes exited with non-zero status, thus causing
[2023-09-21T00:40:11.227715Z] 63e6f494 || the job to be terminated. The first process to do so was:
[2023-09-21T00:40:11.227718Z] 63e6f494 ||
[2023-09-21T00:40:11.227721Z] 63e6f494 ||   Process name: [[14234,1],0]
[2023-09-21T00:40:11.227723Z] 63e6f494 ||   Exit code:    1
[2023-09-21T00:40:11.227730Z] 63e6f494 || --------------------------------------------------------------------------
[2023-09-21T00:40:12.264690Z] 37d23717 || INFO: resources exited successfully with a zero exit code
[2023-09-21T00:40:13.400432Z]          || ERROR: Trial 268 (Experiment 268) was terminated: allocation stopped after resources failed with non-zero exit code: container failed with non-zero exit code: 1 (exit code 1)

It looks like, as you say, the program does go inside horovodrun, but somehow PyTorchTrial does fail to recognize the fact that there are two workers that need to be coordinated.

rb-determined-ai commented 1 year ago

Hi @taroTan1997, did you intend to close this issue? It looks like you've demonstrated a clear bug in our system, I think it should remain open. I still think I'll need to do more debugging with you as I don't know how to reproduce this on my own.

For starters, can you describe your cluster installation? Which subcommand of det deploy are you using? When it's not so early for me, I can send you a script to run in place of your current script, to try to narrow down where the bug is occuring.

tzj-scau commented 1 year ago

Thank you for your reply, I tried a different docker image using a rebuild image based on the official determinedai/environments:cuda-11.3-pytorch-1.12-tf-2.11-gpu-mpi-0.24.0 image, and the problem seems to be solved, the problem seems to be with the mpi module of the image that I built before.

rb-determined-ai commented 1 year ago

thanks for following up. If you don't need the mpi image, then you should be good, right?

I'll get somebody to look into the mpi image anyway, this behavior doesn't seem right at all.

rb-determined-ai commented 1 year ago

Apparently it is expected behavior for the mpi image to not work except when running on a slurm or pbspro-backed cluster, which is only the case for some installations of the Determined Enterprise Edition. Or whatever the right HPE name is, there's too many acronyms I don't remember.

So we are discussing how we can make it clearer how to avoid this pitfall, and we are sorry it cost you time figuring it out, but it isn't actually a bug.

rb-determined-ai commented 1 year ago

@taroTan1997 I actually have a followup question, which is: what path did you take to choose the mpi image? So I can understand where best to document the pitfalls.