🤔[question] duplicate key value violates unique constraint "steps_trial_id_total_batches_run_id_unique"

determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

Apache License 2.0

3.04k stars 356 forks source link

Describe your question

I ran my own program using an image I built myself, and upon completion of the test I found the following problem that did not occur on a single GPU, when trying multiple GPUs (one GPU for each of the two machines)

determined.common.api.errors.APIException: {"error":{"code":13,"reason":"Internal","error":"failed to exec transaction (add training metrics): inserting metrics into raw_steps: ERROR: duplicate key value violates unique constraint \"steps_trial_id_total_batches_run_id_unique\" (SQLSTATE 23505)"}}

I tried searching for previous issues but did not find a specific solution, I am using version 0.22.2 of the determined version , using det deploy for deployment, in order to reproduce my error, I refer to a mod provided in the previous issue to try to reproduce my error, the error code as the following code reality

name: rb-name
description: rb-pytorch-onevar
entrypoint: model_def:OneVarPytorchTrial
hyperparameters:
  global_batch_size: 4
  i:
    type: int
    minval: 1
    maxval: 4
searcher:
   name: random
   max_trials: 1
   metric: loss
   max_length:
     batches: 1024
   smaller_is_better: true

scheduling_unit: 100

max_restarts: 0

resources:
  slots_per_trial: 2

optimizations:
  average_training_metrics: false
  aggregation_frequency: 2

environment:
  image:
    # gpu: zxy1998/wsn640:pigMage-0.22.2
    # gpu: zxy1998/wsn640:pigMage-pytoch1.10-mpi
    gpu: zxy1998/wsn640:pigMage-pytoch1.10.2-ompi

from typing import Any, Dict, Sequence, Tuple, Union, cast, List, Callable

import torch
from torch import nn

from determined import pytorch

class OnesDataset(torch.utils.data.Dataset):
    def __len__(self) -> int:
        return 1024

    def __getitem__(self, index: int) -> torch.Tensor:
        return torch.Tensor([1.0])

def custom_reducer(outputs: List[Any]) -> Any:
    outputs = [o for output in outputs for o in output]
    print(f'got {len(outputs)} outputs')
    return 2.5

class OneVarPytorchTrial(pytorch.PyTorchTrial):
    def __init__(self, context: pytorch.PyTorchTrialContext) -> None:
        # context.env.experiment_config["records_per_epoch"] = 314
        self.context = context

        self.model = context.wrap_model(nn.Linear(1, 1, False))
        # initialize weights to 0
        self.model.weight.data.fill_(0)
        self.opt = context.wrap_optimizer(
            torch.optim.SGD(self.model.parameters(), lr=0.001), backward_passes_per_step=2
        )
        self.fnreducer = self.context.wrap_reducer(custom_reducer, "custom_metric")

    def train_batch(
        self, batch: pytorch.TorchData, epoch_idx: int, batch_idx: int
    ) -> Dict[str, torch.Tensor]:
        loss = torch.nn.MSELoss()(self.model(batch), batch)
        self.context.backward(loss)
        self.context.step_optimizer(self.opt)
        self.fnreducer.update(list(batch))
        return {"loss": loss}

    def evaluate_batch(self, batch: pytorch.TorchData, batch_idx: int) -> Dict[str, Any]:
        data = labels = batch
        loss = torch.nn.MSELoss()(self.model(data), labels)
        self.fnreducer.update(list(batch))
        return {"loss": loss}

    def build_training_data_loader(self) -> pytorch.DataLoader:
        return pytorch.DataLoader(
            OnesDataset(), batch_size=self.context.get_per_slot_batch_size()
        )

    def build_validation_data_loader(self) -> pytorch.DataLoader:
        return pytorch.DataLoader(
            OnesDataset(), batch_size=self.context.get_per_slot_batch_size()
        )

(base) PS A:\mage-main> det e create const.yaml . -f
Preparing files to send to master... 4.6MB and 50 files
Created experiment 266
Waiting for first trial to begin...
Following first trial with ID 266
[2023-09-20T12:12:24.146084Z]          || INFO: Scheduling Trial 266 (Experiment 266) (id: 13f0d348-8d7a-4aa3-ac08-cfeb50386346)
[2023-09-20T12:20:05.275485Z]          || INFO: Trial 266 (Experiment 266) was assigned to an agent
[2023-09-20T12:20:05.279254Z] 5abd5934 || INFO: image already found, skipping pull phase: docker.io/zxy1998/wsn640:pigMage-pytoch1.10.2-ompi
[2023-09-20T12:20:05.297539Z] 5abd5934 || INFO: copying files to container: /
[2023-09-20T12:20:05.382957Z] 5abd5934 || INFO: copying files to container: /run/determined
[2023-09-20T12:20:05.422194Z] 86d48338 || INFO: image already found, skipping pull phase: docker.io/zxy1998/wsn640:pigMage-pytoch1.10.2-ompi
[2023-09-20T12:20:05.437035Z] 86d48338 || INFO: copying files to container: /
[2023-09-20T12:20:05.442678Z] 86d48338 || INFO: copying files to container: /run/determined
[2023-09-20T12:20:05.449843Z] 86d48338 || INFO: copying files to container: /
[2023-09-20T12:20:05.457435Z] 86d48338 || INFO: copying files to container: /
[2023-09-20T12:20:05.463729Z] 86d48338 || INFO: copying files to container: /
[2023-09-20T12:20:05.469052Z] 5abd5934 || INFO: copying files to container: /
[2023-09-20T12:20:05.470159Z] 86d48338 || INFO: copying files to container: /
[2023-09-20T12:20:05.476239Z] 86d48338 || INFO: copying files to container: /
[2023-09-20T12:20:05.483279Z] 86d48338 || INFO: copying files to container: /
[2023-09-20T12:20:05.558921Z] 5abd5934 || INFO: copying files to container: /
[2023-09-20T12:20:05.622343Z] 5abd5934 || INFO: copying files to container: /
[2023-09-20T12:20:05.715519Z] 5abd5934 || INFO: copying files to container: /
[2023-09-20T12:20:05.777951Z] 5abd5934 || INFO: copying files to container: /
[2023-09-20T12:20:05.862410Z] 5abd5934 || INFO: copying files to container: /
[2023-09-20T12:20:06.213706Z] 5abd5934 || INFO: Resources for Trial 266 (Experiment 266) have started
[2023-09-20T12:20:06.213999Z] 5abd5934 ||
[2023-09-20T12:20:06.214117Z] 5abd5934 || ==========
[2023-09-20T12:20:06.214171Z] 5abd5934 || == CUDA ==
[2023-09-20T12:20:06.214276Z] 5abd5934 || ==========
[2023-09-20T12:20:06.220007Z] 5abd5934 ||
[2023-09-20T12:20:06.220047Z] 5abd5934 || CUDA Version 11.3.1
[2023-09-20T12:20:06.221602Z] 5abd5934 ||
[2023-09-20T12:20:06.221607Z] 5abd5934 || Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
[2023-09-20T12:20:06.223653Z] 5abd5934 ||
[2023-09-20T12:20:06.223659Z] 5abd5934 || This container image and its contents are governed by the NVIDIA Deep Learning Container License.
[2023-09-20T12:20:06.223664Z] 5abd5934 || By pulling and using the container, you accept the terms and conditions of this license:
[2023-09-20T12:20:06.223669Z] 5abd5934 || https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
[2023-09-20T12:20:06.223674Z] 5abd5934 ||
[2023-09-20T12:20:06.223679Z] 5abd5934 || A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
[2023-09-20T12:20:06.243968Z] 5abd5934 ||
[2023-09-20T12:20:06.924063Z] 86d48338 || 
[2023-09-20T12:20:06.924330Z] 86d48338 || ==========
[2023-09-20T12:20:06.924338Z] 86d48338 || == CUDA ==
[2023-09-20T12:20:06.924528Z] 86d48338 || ==========
[2023-09-20T12:20:06.930949Z] 86d48338 || INFO: Resources for Trial 266 (Experiment 266) have started
[2023-09-20T12:20:06.930083Z] 86d48338 ||
[2023-09-20T12:20:06.930087Z] 86d48338 || CUDA Version 11.3.1
[2023-09-20T12:20:06.932329Z] 86d48338 ||
[2023-09-20T12:20:06.932333Z] 86d48338 || Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
[2023-09-20T12:20:06.934178Z] 86d48338 || 
[2023-09-20T12:20:06.934183Z] 86d48338 || This container image and its contents are governed by the NVIDIA Deep Learning Container License.
[2023-09-20T12:20:06.934186Z] 86d48338 || By pulling and using the container, you accept the terms and conditions of this license:
[2023-09-20T12:20:06.934188Z] 86d48338 || https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
[2023-09-20T12:20:06.934190Z] 86d48338 || 
[2023-09-20T12:20:06.934192Z] 86d48338 || A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
[2023-09-20T12:20:06.949949Z] 86d48338 || 
[2023-09-20T12:20:09.355306Z] 5abd5934 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[2023-09-20T12:20:10.112857Z] 86d48338 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
[2023-09-20T12:21:07.867141Z] 5abd5934 || INFO: [43] root: detected 1 gpus
[2023-09-20T12:21:07.867648Z] 5abd5934 || INFO: [43] root: Running task container on agent_id=determined-agent-0, hostname=wsn640-1 with visible GPUs ['GPU-6f1d5eb5-53d7-ebaa-6e21-113d1d122ce0']
[2023-09-20T12:21:07.917117Z] 5abd5934 || + test -f startup-hook.sh
[2023-09-20T12:21:07.917135Z] 5abd5934 || + set +x
[2023-09-20T12:21:11.965910Z] 86d48338 || INFO: [43] root: detected 1 gpus
[2023-09-20T12:21:11.966612Z] 86d48338 || INFO: [43] root: Running task container on agent_id=wsn640-2, hostname=wsn640-2 with visible GPUs ['GPU-e5946e97-eb13-29bb-d4da-01dd9101bd2d']
[2023-09-20T12:21:13.145995Z] 86d48338 || + test -f startup-hook.sh
[2023-09-20T12:21:13.146008Z] 86d48338 || + set +x
[2023-09-20T12:21:13.583150Z] 5abd5934 || INFO: [49] root: New trial runner in (container 5abd5934-6326-49ff-a857-3e73fdde181b) on agent determined-agent-0: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/root/.local/share/determined", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": "rb-pytorch-onevar", "entrypoint": "model_def:OneVarPytorchTrial", "environment": {"image": {"cpu": "determinedai/environments:py-3.8-pytorch-1.12-tf-2.11-cpu-6eceaca", "cuda": "zxy1998/wsn640:pigMage-pytoch1.10.2-ompi", "rocm": "determinedai/environments:rocm-5.0-pytorch-1.10-tf-2.7-rocm-6eceaca"}, "environment_variables": {"cpu": [], "cuda": [], "rocm": []}, "proxy_ports": [], "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": null, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"global_batch_size": {"type": "const", "val": 4}, "i": {"maxval": 4, "minval": 1, "type": "int"}}, "labels": [], "max_restarts": 0, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "rb-name", "optimizations": {"aggregation_frequency": 2, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "project": "", "records_per_epoch": 0, "reproducibility": {"experiment_seed": 1695211943}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "resource_pool": "default", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_concurrent_trials": 16, "max_length": {"batches": 1024}, "max_trials": 1, "metric": "loss", "name": "random", "smaller_is_better": true, "source_checkpoint_uuid": null, "source_trial_id": null}, "workspace": "", "slurm": {}, "pbs": {}} 
[2023-09-20T12:21:13.583168Z] 5abd5934 || INFO: [49] root: Validating checkpoint storage ...
"}}, "labels": [], "max_restarts": 0, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "rb-name", "optimizations": {"aggregation_frequency": 2, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "project": "", "records_per_epoch": 0, "reproducibility": {"experiment_seed": 1695211943}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "resource_pool": "default", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_concurrent_trials": 16, "max_length": {"batches": 1024}, "max_trials": 1, "metric": "loss", "name": "random", "smaller_is_better": true, "source_checkpoint_uuid": null, "source_trial_id": null}, "workspace": "", "slurm": {}, "pbs": {}}                                                     [2023-09-20T12:21:13.601214Z] 86d48338 || INFO: [49] root: Validating checkpoint storage ...                            [2023-09-20T12:21:13.601804Z] 86d48338 || INFO: [49] root: Launching: ['python3', '-m', 'determined.launch.horovod', '--autohorovod', '--trial', 'model_def:OneVarPytorchTrial']
[2023-09-20T12:25:38.720517Z] 5abd5934 || Warning: Permanently added '[192.168.123.52]:12350' (RSA) to the list of known hosts.
[2023-09-20T12:25:38.720527Z] 5abd5934 || 
[2023-09-20T12:25:46.015498Z] 5abd5934 [rank=0] || INFO: [198] root: Creating _PyTorchTrialController with OneVarPytorchTrial.
[2023-09-20T12:25:46.156717Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:46.470743Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=100, metrics={'loss': 0.9080761671066284, 'custom_metric': 2.5})
[2023-09-20T12:25:46.595768Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:46.607934Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=200, metrics={'loss': 0.743328332901001, 'custom_metric': 2.5})
[2023-09-20T12:25:46.720786Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:46.734224Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=300, metrics={'loss': 0.6084386706352234, 'custom_metric': 2.5})
[2023-09-20T12:25:46.855203Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:46.867799Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=400, metrics={'loss': 0.4980113208293915, 'custom_metric': 2.5})
[2023-09-20T12:25:46.978375Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:46.991437Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=500, metrics={'loss': 0.4076557159423828, 'custom_metric': 2.5})
[2023-09-20T12:25:47.102772Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:47.116789Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=600, metrics={'loss': 0.33367738127708435, 'custom_metric': 2.5})
[2023-09-20T12:25:47.234260Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:47.248523Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=700, metrics={'loss': 0.2731218636035919, 'custom_metric': 2.5})
[2023-09-20T12:25:47.273350Z] 86d48338 [rank=1] || INFO: [96] root: Creating _PyTorchTrialController with OneVarPytorchTrial.
[2023-09-20T12:25:47.363643Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:47.381739Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=800, metrics={'loss': 0.22355258464813232, 'custom_metric': 2.5})
[2023-09-20T12:25:47.472577Z] 86d48338 [rank=1] || got 400 outputs
[2023-09-20T12:25:47.496099Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:47.511928Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=900, metrics={'loss': 0.18301747739315033, 'custom_metric': 2.5})
[2023-09-20T12:25:47.624951Z] 5abd5934 [rank=0] || got 400 outputs
[2023-09-20T12:25:47.637577Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=1000, metrics={'loss': 0.1498124748468399, 'custom_metric': 2.5})
[2023-09-20T12:25:47.706644Z] 5abd5934 [rank=0] || got 96 outputs
[2023-09-20T12:25:47.726261Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_training_metrics(steps_completed=1024, metrics={'loss': 0.13215720653533936, 'custom_metric': 2.5})
[2023-09-20T12:25:47.782496Z] 86d48338 [rank=1] || INFO: [96] determined.core: report_training_metrics(steps_completed=100, metrics={'loss': 0.9080761671066284, 'custom_metric': 2.5})
[2023-09-20T12:25:47.821201Z] 86d48338 [rank=1] || Traceback (most recent call last):
[2023-09-20T12:25:47.821214Z] 86d48338 [rank=1] ||   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[2023-09-20T12:25:47.821219Z] 86d48338 [rank=1] ||     return _run_code(code, main_globals, None,
[2023-09-20T12:25:47.821224Z] 86d48338 [rank=1] ||   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
[2023-09-20T12:25:47.821244Z] 86d48338 [rank=1] ||     exec(code, run_globals)      
[2023-09-20T12:25:47.821256Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 208, in <module>
[2023-09-20T12:25:47.821340Z] 86d48338 [rank=1] ||     sys.exit(main(args.train_entrypoint))
[2023-09-20T12:25:47.821344Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 43, in main     
[2023-09-20T12:25:47.821371Z] 86d48338 [rank=1] ||     return _run_pytorch_trial(trial_class, info)
[2023-09-20T12:25:47.821375Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 185, in _run_pytorch_trial
[2023-09-20T12:25:47.821441Z] 86d48338 [rank=1] ||     trainer.fit(
[2023-09-20T12:25:47.821444Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_trainer.py", line 189, in fit 
[2023-09-20T12:25:47.821549Z] 86d48338 [rank=1] ||     trial_controller.run()       
[2023-09-20T12:25:47.821555Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 615, in run
[2023-09-20T12:25:47.821676Z] 86d48338 [rank=1] ||     self._run()
[2023-09-20T12:25:47.821685Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 650, in _run
[2023-09-20T12:25:47.821812Z] 86d48338 [rank=1] ||     self._train_for_op(
[2023-09-20T12:25:47.821814Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 748, in _train_for_op
[2023-09-20T12:25:47.821964Z] 86d48338 [rank=1] ||     metrics = self._aggregate_training_metrics(training_metrics)
[2023-09-20T12:25:47.821967Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 346, in _aggregate_training_metrics
[2023-09-20T12:25:47.822056Z] 86d48338 [rank=1] ||     self.core_context.train.report_training_metrics(
[2023-09-20T12:25:47.822059Z] 86d48338 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/[2023-09-20T12:25:47.822241Z] 86d48338 [rank=1] || determined.common.api.errors.APIException: {"error":{"code":13,"reason":"Internal","error":"failed to exec transaction (add training metrics): inserting metrics into raw_steps: ERROR: duplicate key value violates unique constraint \"steps_trial_id_total_batches_run_id_unique\" (SQLSTATE 23505)"}}      [2023-09-20T12:25:47.822250Z] 86d48338 [rank=1] ||
[2023-09-20T12:25:47.865924Z] 5abd5934 [rank=0] || got 1024 outputs                                                   [2023-09-20T12:25:47.865936Z] 5abd5934 [rank=0] || INFO: [198] root: validated: 1024 records in 0.07906s (12952.0 records/s), in 256 batches (3238.0 batches/s)                         [2023-09-20T12:25:47.895921Z] 5abd5934 [rank=0] || INFO: [198] determined.core: report_validation_metrics(steps_completed=1024, metrics={'loss': 0.12879968, 'custom_metric': 2.5})
[2023-09-20T12:25:47.999638Z] 5abd5934 [rank=0] || INFO: [198] determined.core: Reported checkpoint to master 7aae78b5-019c-4a35-a189-28020fc196d9
[2023-09-20T12:25:48.844431Z] 5abd5934 || --------------------------------------------------------------------------  
[2023-09-20T12:25:48.844440Z] 5abd5934 || Primary job  terminated normally, but 1 process returned                    [2023-09-20T12:25:48.844446Z] 5abd5934 || a non-zero exit code. Per user-direction, the job has been aborted.         [2023-09-20T12:25:48.844449Z] 5abd5934 || --------------------------------------------------------------------------  
[2023-09-20T12:25:52.237399Z] 5abd5934 || --------------------------------------------------------------------------  
[2023-09-20T12:25:52.237405Z] 5abd5934 || mpirun detected that one or more processes exited with non-zero status, thus causing                                                                                                              [2023-09-20T12:25:52.237409Z] 5abd5934 || the job to be terminated. The first process to do so was:
[2023-09-20T12:25:52.237414Z] 5abd5934 ||                                                                             [2023-09-20T12:25:52.237420Z] 5abd5934 ||   Process name: [[2293,1],1]
[2023-09-20T12:25:52.237426Z] 5abd5934 ||   Exit code:    1                                                           [2023-09-20T12:25:52.237431Z] 5abd5934 || --------------------------------------------------------------------------  
[2023-09-20T12:25:52.471027Z]          || INFO: forcibly killing allocation's remaining resources (reason: resources failed with non-zero exit code: container failed with non-zero exit code: 1 (exit code 1))
[2023-09-20T12:25:52.637204Z]          || DEBUG: Trial 266 (Experiment 266) was terminated: allocation stopped after resources failed with non-zero exit code: container failed with non-zero exit code: 137 (exit code 137)
Trial log stream ended. To reopen log stream, run: det trial logs -f 266

I'm a newbie and any tips and guidance would be greatly appreciated!

Checklist

[X] Did you search the docs for a solution?
[X] Did you search github issues to find if somebody asked this question before?

print('rank', context.distributed.rank) print('size', context.distributed.size) print('local_rank', context.distributed.local_rank) print('local_size', context.distributed.local_size) print('cross_rank', context.distributed.cross_rank) print('cross_size', context.distributed.cross_size)

(base) PS A:\mage-main> det e create const.yaml . -f Preparing files to send to master... 4.6MB and 50 files Created experiment 268 Waiting for first trial to begin... Following first trial with ID 268 [2023-09-21T00:34:25.295069Z] || INFO: Scheduling Trial 268 (Experiment 268) (id: 0960a169-dc77-4d0c-bd28-041924182b4b) [2023-09-21T00:34:25.589225Z] || INFO: Trial 268 (Experiment 268) was assigned to an agent [2023-09-21T00:34:25.594080Z] 37d23717 || INFO: image already found, skipping pull phase: docker.io/zxy1998/wsn640:pigMage-pytoch1.10.2-ompi [2023-09-21T00:34:25.616409Z] 37d23717 || INFO: copying files to container: / [2023-09-21T00:34:25.666777Z] 63e6f494 || INFO: image already found, skipping pull phase: docker.io/zxy1998/wsn640:pigMage-pytoch1.10.2-ompi [2023-09-21T00:34:25.683049Z] 63e6f494 || INFO: copying files to container: / [2023-09-21T00:34:25.694207Z] 63e6f494 || INFO: copying files to container: /run/determined [2023-09-21T00:34:25.703628Z] 63e6f494 || INFO: copying files to container: / [2023-09-21T00:34:25.712071Z] 63e6f494 || INFO: copying files to container: / [2023-09-21T00:34:25.718890Z] 63e6f494 || INFO: copying files to container: / [2023-09-21T00:34:25.716124Z] 37d23717 || INFO: copying files to container: /run/determined [2023-09-21T00:34:25.725118Z] 63e6f494 || INFO: copying files to container: / [2023-09-21T00:34:25.731812Z] 63e6f494 || INFO: copying files to container: / [2023-09-21T00:34:25.739570Z] 63e6f494 || INFO: copying files to container: / [2023-09-21T00:34:25.796246Z] 37d23717 || INFO: copying files to container: / [2023-09-21T00:34:25.856905Z] 37d23717 || INFO: copying files to container: / [2023-09-21T00:34:25.926394Z] 37d23717 || INFO: copying files to container: / [2023-09-21T00:34:25.985663Z] 37d23717 || INFO: copying files to container: / [2023-09-21T00:34:26.084340Z] 37d23717 || INFO: copying files to container: / [2023-09-21T00:34:26.167414Z] 37d23717 || INFO: copying files to container: / [2023-09-21T00:34:26.462409Z] 37d23717 || INFO: Resources for Trial 268 (Experiment 268) have started [2023-09-21T00:34:26.499189Z] 37d23717 || [2023-09-21T00:34:26.499346Z] 37d23717 || ========== [2023-09-21T00:34:26.499381Z] 37d23717 || == CUDA == [2023-09-21T00:34:26.499578Z] 37d23717 || ========== [2023-09-21T00:34:26.504997Z] 37d23717 || [2023-09-21T00:34:26.505109Z] 37d23717 || CUDA Version 11.3.1 [2023-09-21T00:34:26.507156Z] 37d23717 || [2023-09-21T00:34:26.507163Z] 37d23717 || Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. [2023-09-21T00:34:26.509425Z] 37d23717 || [2023-09-21T00:34:26.509431Z] 37d23717 || This container image and its contents are governed by the NVIDIA Deep Learning Container License. [2023-09-21T00:34:26.509436Z] 37d23717 || By pulling and using the container, you accept the terms and conditions of this license: [2023-09-21T00:34:26.509441Z] 37d23717 || https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license [2023-09-21T00:34:26.509447Z] 37d23717 || [2023-09-21T00:34:26.509452Z] 37d23717 || A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience. [2023-09-21T00:34:26.528102Z] 37d23717 || [2023-09-21T00:34:27.156123Z] 63e6f494 || INFO: Resources for Trial 268 (Experiment 268) have started [2023-09-21T00:34:27.160120Z] 63e6f494 || [2023-09-21T00:34:27.160169Z] 63e6f494 || ========== [2023-09-21T00:34:27.160174Z] 63e6f494 || == CUDA == [2023-09-21T00:34:27.160632Z] 63e6f494 || ========== [2023-09-21T00:34:27.164706Z] 63e6f494 || [2023-09-21T00:34:27.164762Z] 63e6f494 || CUDA Version 11.3.1 [2023-09-21T00:34:27.166361Z] 63e6f494 || [2023-09-21T00:34:27.166369Z] 63e6f494 || Container image Copyright (c) 2016-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. [2023-09-21T00:34:27.167989Z] 63e6f494 || [2023-09-21T00:34:27.167991Z] 63e6f494 || This container image and its contents are governed by the NVIDIA Deep Learning Container License. [2023-09-21T00:34:27.168005Z] 63e6f494 || By pulling and using the container, you accept the terms and conditions of this license: [2023-09-21T00:34:27.168008Z] 63e6f494 || https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license [2023-09-21T00:34:27.168010Z] 63e6f494 || [2023-09-21T00:34:27.168012Z] 63e6f494 || A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience. [2023-09-21T00:34:27.182121Z] 63e6f494 || [2023-09-21T00:34:29.553915Z] 37d23717 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv [2023-09-21T00:34:30.365417Z] 63e6f494 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv [2023-09-21T00:35:27.793792Z] 37d23717 || INFO: [43] root: detected 1 gpus [2023-09-21T00:35:27.794276Z] 37d23717 || INFO: [43] root: Running task container on agent_id=determined-agent-0, hostname=wsn640-1 with visible GPUs ['GPU-6f1d5eb5-53d7-ebaa-6e21-113d1d122ce0'] [2023-09-21T00:35:27.856594Z] 37d23717 || + test -f startup-hook.sh [2023-09-21T00:35:27.856606Z] 37d23717 || + set +x [2023-09-21T00:35:32.646705Z] 63e6f494 || INFO: [43] root: detected 1 gpus [2023-09-21T00:35:32.647235Z] 63e6f494 || INFO: [43] root: Running task container on agent_id=wsn640-2, hostname=wsn640-2 with visible GPUs ['GPU-e5946e97-eb13-29bb-d4da-01dd9101bd2d'] [2023-09-21T00:35:33.777840Z] 63e6f494 || + test -f startup-hook.sh [2023-09-21T00:35:33.777863Z] 63e6f494 || + set +x h1.10.2-ompi", "rocm": "determinedai/environments:rocm-5.0-pytorch-1.10-tf-2.7-rocm-6eceaca"}, "environment_variables": {"cpu": [], "cuda": [], "rocm": []}, "proxy_ports": [], "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": null, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"global_batch_size": {"type": "const", "val": 4}, "i": {"maxval": 4, "minval": 1, "type": "int"}}, "labels": [], "max_restarts": 0, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "rb-name", "optimizations": {"aggregation_frequency": 2, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "project": "", "records_per_epoch": 0, "reproducibility": {"experiment_seed": 1695256464}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "resource_pool": "default", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_concurrent_trials": 16, "max_length": {"batches": 1024}, "max_trials": 1, "metric": "loss", "name": "random", "smaller_is_better": true, "source_checkpoint_uuid": null, "source_trial_id": null}, "workspace": "", "slurm": {}, "pbs": {}}[2023-09-21T00:35:34.221918Z] 63e6f494 || INFO: [49] root: Validating checkpoint storage ...[2023-09-21T00:35:34.222941Z] 63e6f494 || INFO: [49] root: Launching: ['python3', '-m', 'determined.launch.horovod', '--autohorovod', '--trial', 'model_def:OneVarPytorchTrial'] [2023-09-21T00:35:34.215061Z] 37d23717 || INFO: [49] root: New trial runner in (container 37d23717-f4a4-436c-9dff-8e16b9b0d233) on agent determined-agent-0: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/root/.local/share/determined", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": "rb-pytorch-onevar", "entrypoint": "model_def:OneVarPytorchTrial", "environment": {"image": {"cpu": "determinedai/environments:py-3.8-pytorch-1.12-tf-2.11-cpu-6eceaca", "cuda": "zxy1998/wsn640:pigMage-pytoch1.10.2-ompi", "rocm": "determinedai/environments:rocm-5.0-pytorch-1.10-tf-2.7-rocm-6eceaca"}, "environment_variables": {"cpu": [], "cuda": [], "rocm": []}, "proxy_ports": [], "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": null, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"global_batch_size": {"type": "const", "val": 4}, "i": {"maxval": 4, "minval": 1, "type": "int"}}, "labels": [], "max_restarts": 0, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "rb-name", "optimizations": {"aggregation_frequency": 2, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "project": "", "records_per_epoch": 0, "reproducibility": {"experiment_seed": 1695256464}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "resource_pool": "default", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_concurrent_trials": 16, "max_length": {"batches": 1024}, "max_trials": 1, "metric": "loss", "name": "random", "smaller_is_better": true, "source_checkpoint_uuid": null, "source_trial_id": null}, "workspace": "", "slurm": {}, "pbs": {}}[2023-09-21T00:35:34.215073Z] 37d23717 || INFO: [49] root: Validating checkpoint storage ...[2023-09-21T00:35:34.215615Z] 37d23717 || INFO: [49] root: Launching: ['python3', '-m', 'determined.launch.horovod', '--autohorovod', '--trial', 'model_def:OneVarPytorchTrial'] [2023-09-21T00:39:58.992200Z] 63e6f494 || Warning: Permanently added '[192.168.123.83]:12350' (RSA) to the list of known hosts. [2023-09-21T00:39:58.992215Z] 63e6f494 || [2023-09-21T00:40:06.101146Z] 37d23717 [rank=1] || rank 0 [2023-09-21T00:40:06.101157Z] 37d23717 [rank=1] || size 1 [2023-09-21T00:40:06.101162Z] 37d23717 [rank=1] || local_rank 0 [2023-09-21T00:40:06.101166Z] 37d23717 [rank=1] || local_size 1 [2023-09-21T00:40:06.101171Z] 37d23717 [rank=1] || cross_rank 0 [2023-09-21T00:40:06.101176Z] 37d23717 [rank=1] || cross_size 1 [2023-09-21T00:40:06.101315Z] 37d23717 [rank=1] || INFO: [94] root: Creating _PyTorchTrialController with OneVarPytorchTrial. [2023-09-21T00:40:06.249983Z] 37d23717 [rank=1] || got 400 outputs [2023-09-21T00:40:06.558826Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=100, metrics={'loss': 0.9080761671066284, 'custom_metric': 2.5}) [2023-09-21T00:40:06.687340Z] 37d23717 [rank=1] || got 400 outputs [2023-09-21T00:40:06.700000Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=200, metrics={'loss': 0.743328332901001, 'custom_metric': 2.5}) [2023-09-21T00:40:06.818668Z] 37d23717 [rank=1] || got 400 outputs [2023-09-21T00:40:06.832929Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=300, metrics={'loss': 0.6084386706352234, 'custom_metric': 2.5}) [2023-09-21T00:40:06.949258Z] 37d23717 [rank=1] || got 400 outputs [2023-09-21T00:40:06.963984Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=400, metrics={'loss': 0.4980113208293915, 'custom_metric': 2.5}) [2023-09-21T00:40:07.086359Z] 37d23717 [rank=1] || got 400 outputs [2023-09-21T00:40:07.098676Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=500, metrics={'loss': 0.4076557159423828, 'custom_metric': 2.5}) [2023-09-21T00:40:07.221265Z] 37d23717 [rank=1] || got 400 outputs [2023-09-21T00:40:07.235117Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=600, metrics={'loss': 0.33367738127708435, 'custom_metric': 2.5}) [2023-09-21T00:40:07.237316Z] 63e6f494 [rank=0] || rank 0 [2023-09-21T00:40:07.237324Z] 63e6f494 [rank=0] || size 1 [2023-09-21T00:40:07.237332Z] 63e6f494 [rank=0] || local_rank 0 [2023-09-21T00:40:07.237337Z] 63e6f494 [rank=0] || local_size 1 [2023-09-21T00:40:07.237342Z] 63e6f494 [rank=0] || cross_rank 0 [2023-09-21T00:40:07.237346Z] 63e6f494 [rank=0] || cross_size 1 [2023-09-21T00:40:07.237501Z] 63e6f494 [rank=0] || INFO: [197] root: Creating _PyTorchTrialController with OneVarPytorchTrial. [2023-09-21T00:40:07.369833Z] 37d23717 [rank=1] || got 400 outputs [2023-09-21T00:40:07.384608Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=700, metrics={'loss': 0.2731218636035919, 'custom_metric': 2.5}) [2023-09-21T00:40:07.393932Z] 63e6f494 [rank=0] || got 400 outputs [2023-09-21T00:40:07.504618Z] 37d23717 [rank=1] || got 400 outputs [2023-09-21T00:40:07.519455Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=800, metrics={'loss': 0.22355258464813232, 'custom_metric': 2.5}) [2023-09-21T00:40:07.640500Z] 37d23717 [rank=1] || got 400 outputs [2023-09-21T00:40:07.655329Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=900, metrics={'loss': 0.18301747739315033, 'custom_metric': 2.5}) [2023-09-21T00:40:07.685031Z] 63e6f494 [rank=0] || INFO: [197] determined.core: report_training_metrics(steps_completed=100, metrics={'loss': 0.9080761671066284, 'custom_metric': 2.5}) [2023-09-21T00:40:07.705741Z] 63e6f494 [rank=0] || Traceback (most recent call last): [2023-09-21T00:40:07.705745Z] 63e6f494 [rank=0] || File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main [2023-09-21T00:40:07.705943Z] 63e6f494 [rank=0] || return _run_code(code, main_globals, None, [2023-09-21T00:40:07.705946Z] 63e6f494 [rank=0] || File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code [2023-09-21T00:40:07.706072Z] 63e6f494 [rank=0] || exec(code, run_globals) [2023-09-21T00:40:07.706075Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 208, in <module> [2023-09-21T00:40:07.706248Z] 63e6f494 [rank=0] || sys.exit(main(args.train_entrypoint)) [2023-09-21T00:40:07.706250Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 43, in main [2023-09-21T00:40:07.706354Z] 63e6f494 [rank=0] || return _run_pytorch_trial(trial_class, info) [2023-09-21T00:40:07.706356Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 185, in _run_pytorch_trial [2023-09-21T00:40:07.706505Z] 63e6f494 [rank=0] || trainer.fit( [2023-09-21T00:40:07.706509Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_trainer.py", line 189, in fit [2023-09-21T00:40:07.706713Z] 63e6f494 [rank=0] || trial_controller.run() [2023-09-21T00:40:07.706717Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 615, in run [2023-09-21T00:40:07.707046Z] 63e6f494 [rank=0] || self._run() [2023-09-21T00:40:07.707049Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 650, in _run [2023-09-21T00:40:07.707354Z] 63e6f494 [rank=0] || self._train_for_op( [2023-09-21T00:40:07.707357Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 748, in _train_for_op [2023-09-21T00:40:07.707700Z] 63e6f494 [rank=0] || metrics = self._aggregate_training_metrics(training_metrics) [2023-09-21T00:40:07.707702Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/pytorch/_pytorch_trial.py", line 346, in _aggregate_training_metrics [2023-09-21T00:40:07.707888Z] 63e6f494 [rank=0] || self.core_context.train.report_training_metrics( [2023-09-21T00:40:07.707891Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/core/_train.py", line 95, in report_training_metrics [2023-09-21T00:40:07.708003Z] 63e6f494 [rank=0] || self._session.post( [2023-09-21T00:40:07.708005Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/common/api/_session.py", line 79, in post [2023-09-21T00:40:07.708118Z] 63e6f494 [rank=0] || return self._do_request("POST", path, params, json, data, headers, timeout, False) [2023-09-21T00:40:07.708120Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/common/api/_session.py", line 36, in _do_request [2023-09-21T00:40:07.708211Z] 63e6f494 [rank=0] || return request.do_request( [2023-09-21T00:40:07.708213Z] 63e6f494 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/common/api/request.py", line 168, in do_request [2023-09-21T00:40:07.708346Z] 63e6f494 [rank=0] || raise errors.APIException(r) [2023-09-21T00:40:07.708388Z] 63e6f494 [rank=0] || determined.common.api.errors.APIException: {"error":{"code":13,"reason":"Internal","error":"failed to exec transaction (add training metrics): inserting metrics into raw_steps: ERROR: duplicate key value violates unique constraint \"steps_trial_id_total_batches_run_id_unique\" (SQLSTATE 23505)"}} [2023-09-21T00:40:07.708391Z] 63e6f494 [rank=0] || [2023-09-21T00:40:07.792246Z] 37d23717 [rank=1] || got 400 outputs [2023-09-21T00:40:07.838419Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=1000, metrics={'loss': 0.1498124748468399, 'custom_metric': 2.5}) [2023-09-21T00:40:07.907605Z] 37d23717 [rank=1] || got 96 outputs [2023-09-21T00:40:07.911235Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_training_metrics(steps_completed=1024, metrics={'loss': 0.13215720653533936, 'custom_metric': 2.5}) [2023-09-21T00:40:08.046530Z] 37d23717 [rank=1] || got 1024 outputs [2023-09-21T00:40:08.046551Z] 37d23717 [rank=1] || INFO: [94] root: validated: 1024 records in 0.07591s (13490.0 records/s), in 256 batches (3373.0 batches/s) [2023-09-21T00:40:08.107789Z] 37d23717 [rank=1] || INFO: [94] determined.core: report_validation_metrics(steps_completed=1024, metrics={'loss': 0.12879968, 'custom_metric': 2.5}) [2023-09-21T00:40:08.213961Z] 37d23717 [rank=1] || INFO: [94] determined.core: Reported checkpoint to master cb890c7f-b9ef-497a-a724-46a0815204f9 [2023-09-21T00:40:09.214839Z] 63e6f494 || -------------------------------------------------------------------------- [2023-09-21T00:40:09.214857Z] 63e6f494 || Primary job terminated normally, but 1 process returned [2023-09-21T00:40:09.214861Z] 63e6f494 || a non-zero exit code. Per user-direction, the job has been aborted. [2023-09-21T00:40:09.214864Z] 63e6f494 || -------------------------------------------------------------------------- [2023-09-21T00:40:11.227703Z] 63e6f494 || -------------------------------------------------------------------------- [2023-09-21T00:40:11.227712Z] 63e6f494 || mpirun detected that one or more processes exited with non-zero status, thus causing [2023-09-21T00:40:11.227715Z] 63e6f494 || the job to be terminated. The first process to do so was: [2023-09-21T00:40:11.227718Z] 63e6f494 || [2023-09-21T00:40:11.227721Z] 63e6f494 || Process name: [[14234,1],0] [2023-09-21T00:40:11.227723Z] 63e6f494 || Exit code: 1 [2023-09-21T00:40:11.227730Z] 63e6f494 || -------------------------------------------------------------------------- [2023-09-21T00:40:12.264690Z] 37d23717 || INFO: resources exited successfully with a zero exit code [2023-09-21T00:40:13.400432Z] || ERROR: Trial 268 (Experiment 268) was terminated: allocation stopped after resources failed with non-zero exit code: container failed with non-zero exit code: 1 (exit code 1)

determined-ai / determined

🤔[question] duplicate key value violates unique constraint "steps_trial_id_total_batches_run_id_unique" #7939

Describe your question

Checklist