Closed ramakrishnamamidi closed 2 years ago
Detailed Log of the experiment
<info> [2021-12-07 10:39:05] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Pod resources allocated.
<info> [2021-12-07 10:39:05] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Pod resources allocated.
<info> [2021-12-07 10:39:05] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:39:05] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:39:06] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 970.970925ms
<info> [2021-12-07 10:39:06] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 903.701493ms
<info> [2021-12-07 10:39:06] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Created container determined-init-container
<info> [2021-12-07 10:39:06] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Created container determined-init-container
<info> [2021-12-07 10:39:06] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Started container determined-init-container
<info> [2021-12-07 10:39:06] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Started container determined-init-container
<info> [2021-12-07 10:39:07] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Pulling image "fluent/fluent-bit:1.6"
<info> [2021-12-07 10:39:07] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Pulling image "fluent/fluent-bit:1.6"
<info> [2021-12-07 10:39:08] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Successfully pulled image "fluent/fluent-bit:1.6" in 1.197434698s
<info> [2021-12-07 10:39:08] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Created container determined-fluent-container
<info> [2021-12-07 10:39:08] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Started container determined-fluent-container
<info> [2021-12-07 10:39:08] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:39:08] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Successfully pulled image "fluent/fluent-bit:1.6" in 1.174634066s
<info> [2021-12-07 10:39:08] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Created container determined-fluent-container
<info> [2021-12-07 10:39:09] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Started container determined-fluent-container
<info> [2021-12-07 10:39:09] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:39:09] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 891.734985ms
<info> [2021-12-07 10:39:09] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Created container determined-container
<info> [2021-12-07 10:39:09] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Started container determined-container
<info> [2021-12-07 10:39:09] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 887.005158ms
<info> [2021-12-07 10:39:10] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Created container determined-container
<info> [2021-12-07 10:39:10] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Started container determined-container
<> [2021-12-07 10:39:11] 26606941 || + STARTUP_HOOK=startup-hook.sh
<> [2021-12-07 10:39:11] 26606941 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:39:11] 26606941 || + '[' -z '' ']'
<> [2021-12-07 10:39:11] 26606941 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:39:11] 26606941 || + /bin/which python3
<> [2021-12-07 10:39:11] 26606941 || + DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:39:11] 26606941 || + export DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:39:11] 26606941 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<> [2021-12-07 10:39:11] 26606941 || + '[' /root = / ']'
<> [2021-12-07 10:39:11] 20c5c597 || + STARTUP_HOOK=startup-hook.sh
<> [2021-12-07 10:39:11] 20c5c597 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:39:11] 20c5c597 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:39:11] 20c5c597 || + '[' -z '' ']'
<> [2021-12-07 10:39:11] 20c5c597 || + DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:39:11] 20c5c597 || + export DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:39:11] 20c5c597 || + /bin/which python3
<> [2021-12-07 10:39:11] 20c5c597 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<> [2021-12-07 10:39:11] 20c5c597 || + '[' /root = / ']'
<warning> [2021-12-07 10:39:11] 26606941 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<> [2021-12-07 10:39:12] 26606941 || + python3 -m determined.exec.prep_container --trial --resources
<warning> [2021-12-07 10:39:12] 20c5c597 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<> [2021-12-07 10:39:12] 20c5c597 || + python3 -m determined.exec.prep_container --trial --resources
<> [2021-12-07 10:39:12] 26606941 || + test -f startup-hook.sh
<> [2021-12-07 10:39:12] 26606941 || + python3 -m determined.exec.prep_container --rendezvous
<> [2021-12-07 10:39:12] 20c5c597 || + test -f startup-hook.sh
<> [2021-12-07 10:39:12] 20c5c597 || + python3 -m determined.exec.prep_container --rendezvous
<> [2021-12-07 10:39:13] 26606941 || + exec python3 -m determined.exec.launch_autohorovod
<> [2021-12-07 10:39:13] 20c5c597 || + exec python3 -m determined.exec.launch_autohorovod
<info> [2021-12-07 10:39:13] 26606941 || INFO: New trial runner in (container 26606941-c636-4569-809f-3dcb7cbd64c0) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<info> [2021-12-07 10:39:13] 20c5c597 || INFO: New trial runner in (container 20c5c597-0b2c-4dec-a6bb-a9ba5acb49f5) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<> [2021-12-07 10:39:15] 26606941 || 2021-12-07 10:39:15.271627: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:39:15] 26606941 || 2021-12-07 10:39:15.271676: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:39:17] 26606941 || 2021-12-07 10:39:17.424460: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:39:17] 26606941 || 2021-12-07 10:39:17.424512: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:39:20] 26606941 [rank=0] || 2021-12-07 10:39:20,264:INFO [175]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<> [2021-12-07 10:39:20] 20c5c597 [rank=1] || 2021-12-07 10:39:20,310:INFO [56]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<> [2021-12-07 10:39:20] 26606941 [rank=0] || 2021-12-07 10:39:20.377771: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:39:20] 26606941 [rank=0] || 2021-12-07 10:39:20.377805: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:39:20] 20c5c597 [rank=1] || 2021-12-07 10:39:20.450750: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:39:20] 20c5c597 [rank=1] || 2021-12-07 10:39:20.450790: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:39:23] 20c5c597 [rank=1] || 2021-12-07 10:39:23,017:INFO [56]: Creating TFKerasTrialController with FlowerClassificationTrial.
<> [2021-12-07 10:39:23] 20c5c597 [rank=1] || 2021-12-07 10:39:23.017584: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:39:23] 20c5c597 [rank=1] || 2021-12-07 10:39:23.017814: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<> [2021-12-07 10:39:23] 20c5c597 [rank=1] || 2021-12-07 10:39:23.017837: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<> [2021-12-07 10:39:23] 20c5c597 [rank=1] || 2021-12-07 10:39:23.017875: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-sol): /proc/driver/nvidia/version does not exist
<> [2021-12-07 10:39:23] 20c5c597 [rank=1] || 2021-12-07 10:39:23.018851: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
<> [2021-12-07 10:39:23] 20c5c597 [rank=1] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<> [2021-12-07 10:39:23] 26606941 [rank=0] || 2021-12-07 10:39:23,020:INFO [175]: Creating TFKerasTrialController with FlowerClassificationTrial.
<> [2021-12-07 10:39:23] 26606941 [rank=0] || 2021-12-07 10:39:23.020706: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:39:23] 26606941 [rank=0] || 2021-12-07 10:39:23.021048: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<> [2021-12-07 10:39:23] 26606941 [rank=0] || 2021-12-07 10:39:23.021026: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<> [2021-12-07 10:39:23] 26606941 [rank=0] || 2021-12-07 10:39:23.021083: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-mer): /proc/driver/nvidia/version does not exist
<> [2021-12-07 10:39:23] 20c5c597 [rank=1] || 2021-12-07 10:39:23.021637: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:39:23] 26606941 [rank=0] || 2021-12-07 10:39:23.022068: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
<> [2021-12-07 10:39:23] 26606941 [rank=0] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<> [2021-12-07 10:39:23] 26606941 [rank=0] || 2021-12-07 10:39:23.023287: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:39:23] 20c5c597 [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:39:23] 26606941 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:39:23] 20c5c597 [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:39:23] 20c5c597 [rank=1] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<> [2021-12-07 10:39:23] 20c5c597 [rank=1] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<> [2021-12-07 10:39:23] 26606941 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:39:23] 26606941 [rank=0] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<> [2021-12-07 10:39:23] 26606941 [rank=0] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<> [2021-12-07 10:39:23] 20c5c597 [rank=1] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<> [2021-12-07 10:39:23] 26606941 [rank=0] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<> [2021-12-07 10:39:24] 20c5c597 [rank=1] || Sequential with layers obj made
<> [2021-12-07 10:39:24] 20c5c597 [rank=1] || Wraped model in context
<> [2021-12-07 10:39:24] 20c5c597 [rank=1] || Model compiled
<> [2021-12-07 10:39:24] 20c5c597 [rank=1] || 2021-12-07 10:39:24,347:WARNING [56]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<> [2021-12-07 10:39:24] 26606941 [rank=0] || Sequential with layers obj made
<> [2021-12-07 10:39:24] 26606941 [rank=0] || Wraped model in context
<> [2021-12-07 10:39:24] 26606941 [rank=0] || Model compiled
<> [2021-12-07 10:39:24] 26606941 [rank=0] || 2021-12-07 10:39:24,746:WARNING [175]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<> [2021-12-07 10:39:24] 26606941 [rank=0] || total batches trained: 0, workload 0% complete (0/100)
<> [2021-12-07 10:39:26] 26606941 [rank=0] || 2021-12-07 10:39:26.430134: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<> [2021-12-07 10:39:26] 26606941 [rank=0] || 2021-12-07 10:39:26.434228: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<> [2021-12-07 10:39:26] 26606941 [rank=0] || Traceback (most recent call last):
<> [2021-12-07 10:39:26] 26606941 [rank=0] || File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<> [2021-12-07 10:39:26] 26606941 [rank=0] || return _run_code(code, main_globals, None,
<> [2021-12-07 10:39:26] 26606941 [rank=0] || exec(code, run_globals)
<> [2021-12-07 10:39:26] 26606941 [rank=0] || File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<> [2021-12-07 10:39:26] 26606941 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<> [2021-12-07 10:39:26] 26606941 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<> [2021-12-07 10:39:26] 26606941 [rank=0] || sys.exit(main(args.chief_ip))
<> [2021-12-07 10:39:26] 26606941 [rank=0] || controller.run()
<> [2021-12-07 10:39:26] 26606941 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<> [2021-12-07 10:39:26] 26606941 [rank=0] || self._launch_fit()
<> [2021-12-07 10:39:26] 26606941 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<> [2021-12-07 10:39:26] 26606941 [rank=0] || self.model.fit(
<> [2021-12-07 10:39:26] 26606941 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<> [2021-12-07 10:39:26] 26606941 [rank=0] || tmp_logs = self.train_function(iterator)
<> [2021-12-07 10:39:26] 26606941 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<> [2021-12-07 10:39:26] 26606941 [rank=0] || result = self._call(*args, **kwds)
<> [2021-12-07 10:39:26] 26606941 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<> [2021-12-07 10:39:26] 26606941 [rank=0] || return self._stateless_fn(*args, **kwds)
<> [2021-12-07 10:39:26] 26606941 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<> [2021-12-07 10:39:26] 26606941 [rank=0] || return graph_function._call_flat(
<> [2021-12-07 10:39:26] 26606941 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<> [2021-12-07 10:39:26] 26606941 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<> [2021-12-07 10:39:26] 26606941 [rank=0] || return self._build_call_outputs(self._inference_function.call(
<> [2021-12-07 10:39:26] 26606941 [rank=0] || outputs = execute.execute(
<> [2021-12-07 10:39:26] 26606941 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<> [2021-12-07 10:39:26] 26606941 [rank=0] || tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<> [2021-12-07 10:39:26] 26606941 [rank=0] || tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
<> [2021-12-07 10:39:26] 26606941 [rank=0] || Function call stack:
<> [2021-12-07 10:39:26] 26606941 [rank=0] ||
<> [2021-12-07 10:39:26] 26606941 [rank=0] || [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<> [2021-12-07 10:39:26] 26606941 [rank=0] || train_function
<> [2021-12-07 10:39:26] 26606941 [rank=0] ||
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || 2021-12-07 10:39:26.543108: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || 2021-12-07 10:39:26.547507: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || Traceback (most recent call last):
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || return _run_code(code, main_globals, None,
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || exec(code, run_globals)
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || sys.exit(main(args.chief_ip))
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || controller.run()
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || self._launch_fit()
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || self.model.fit(
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || tmp_logs = self.train_function(iterator)
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || result = self._call(*args, **kwds)
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || return self._stateless_fn(*args, **kwds)
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || return graph_function._call_flat(
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || return self._build_call_outputs(self._inference_function.call(
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || outputs = execute.execute(
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || train_function
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] ||
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] || Function call stack:
<> [2021-12-07 10:39:26] 20c5c597 [rank=1] ||
<> [2021-12-07 10:39:27] 26606941 || Process 0 exit with status code 1.
<> [2021-12-07 10:39:27] 26606941 || Terminating remaining workers after failure of Process 0.
<> [2021-12-07 10:39:27] 26606941 || Traceback (most recent call last):
<> [2021-12-07 10:39:27] 26606941 || File "/opt/conda/bin/horovodrun", line 8, in <module>
<> [2021-12-07 10:39:27] 26606941 || sys.exit(run_commandline())
<> [2021-12-07 10:39:27] 26606941 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 770, in run_commandline
<> [2021-12-07 10:39:27] 26606941 || _run(args)
<> [2021-12-07 10:39:27] 26606941 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 760, in _run
<> [2021-12-07 10:39:27] 26606941 || return _run_static(args)
<> [2021-12-07 10:39:27] 26606941 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 617, in _run_static
<> [2021-12-07 10:39:27] 26606941 || _launch_job(args, settings, nics, command)
<> [2021-12-07 10:39:27] 26606941 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 730, in _launch_job
<> [2021-12-07 10:39:27] 26606941 || run_controller(args.use_gloo, gloo_run_fn,
<> [2021-12-07 10:39:27] 26606941 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 706, in run_controller
<> [2021-12-07 10:39:27] 26606941 || gloo_run()
<> [2021-12-07 10:39:27] 26606941 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 722, in gloo_run_fn
<> [2021-12-07 10:39:27] 26606941 || gloo_run(settings, nics, env, driver_ip, command)
<> [2021-12-07 10:39:27] 26606941 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 298, in gloo_run
<> [2021-12-07 10:39:27] 26606941 || launch_gloo(command, exec_command, settings, nics, env, server_ip)
<> [2021-12-07 10:39:27] 26606941 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 282, in launch_gloo
<> [2021-12-07 10:39:27] 26606941 || raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
<> [2021-12-07 10:39:27] 26606941 || RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
<> [2021-12-07 10:39:27] 26606941 || Exit code: 1
<> [2021-12-07 10:39:27] 26606941 || Process name: 0
<info> [2021-12-07 10:39:27] 26606941 || INFO: container failed with non-zero exit code: (exit code 1)
<info> [2021-12-07 10:39:44] 20c5c597 || INFO: container failed with non-zero exit code: (exit code 137)
<info> [2021-12-07 10:39:45] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Pod resources allocated.
<info> [2021-12-07 10:39:45] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Pod resources allocated.
<info> [2021-12-07 10:39:46] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:39:46] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:39:47] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 872.456065ms
<info> [2021-12-07 10:39:47] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Created container determined-init-container
<info> [2021-12-07 10:39:47] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 888.796743ms
<info> [2021-12-07 10:39:47] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Created container determined-init-container
<info> [2021-12-07 10:39:47] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Started container determined-init-container
<info> [2021-12-07 10:39:47] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Started container determined-init-container
<info> [2021-12-07 10:39:48] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Pulling image "fluent/fluent-bit:1.6"
<info> [2021-12-07 10:39:48] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Pulling image "fluent/fluent-bit:1.6"
<info> [2021-12-07 10:39:49] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Successfully pulled image "fluent/fluent-bit:1.6" in 1.165672526s
<info> [2021-12-07 10:39:49] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Created container determined-fluent-container
<info> [2021-12-07 10:39:49] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Started container determined-fluent-container
<info> [2021-12-07 10:39:49] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:39:49] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Successfully pulled image "fluent/fluent-bit:1.6" in 1.163258809s
<info> [2021-12-07 10:39:49] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Created container determined-fluent-container
<info> [2021-12-07 10:39:49] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Started container determined-fluent-container
<info> [2021-12-07 10:39:49] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:39:50] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 885.112166ms
<info> [2021-12-07 10:39:50] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Created container determined-container
<info> [2021-12-07 10:39:50] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Started container determined-container
<info> [2021-12-07 10:39:50] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 1.081632232s
<info> [2021-12-07 10:39:51] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Created container determined-container
<info> [2021-12-07 10:39:51] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Started container determined-container
<> [2021-12-07 10:39:52] 30a87832 || + STARTUP_HOOK=startup-hook.sh
<> [2021-12-07 10:39:52] 30a87832 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:39:52] 30a87832 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:39:52] 30a87832 || + '[' -z '' ']'
<> [2021-12-07 10:39:52] 30a87832 || + export DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:39:52] 30a87832 || + DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:39:52] 30a87832 || + /bin/which python3
<> [2021-12-07 10:39:52] 30a87832 || + '[' /root = / ']'
<> [2021-12-07 10:39:52] 30a87832 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<> [2021-12-07 10:39:52] 152337f9 || + STARTUP_HOOK=startup-hook.sh
<> [2021-12-07 10:39:52] 152337f9 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:39:52] 152337f9 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:39:52] 152337f9 || + '[' -z '' ']'
<> [2021-12-07 10:39:52] 152337f9 || + /bin/which python3
<> [2021-12-07 10:39:52] 152337f9 || + export DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:39:52] 152337f9 || + DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:39:52] 152337f9 || + '[' /root = / ']'
<> [2021-12-07 10:39:52] 152337f9 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<warning> [2021-12-07 10:39:52] 30a87832 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<> [2021-12-07 10:39:52] 30a87832 || + python3 -m determined.exec.prep_container --trial --resources
<> [2021-12-07 10:39:53] 30a87832 || + test -f startup-hook.sh
<> [2021-12-07 10:39:53] 30a87832 || + python3 -m determined.exec.prep_container --rendezvous
<warning> [2021-12-07 10:39:53] 152337f9 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<> [2021-12-07 10:39:53] 152337f9 || + python3 -m determined.exec.prep_container --trial --resources
<> [2021-12-07 10:39:53] 152337f9 || + test -f startup-hook.sh
<> [2021-12-07 10:39:53] 152337f9 || + python3 -m determined.exec.prep_container --rendezvous
<> [2021-12-07 10:39:54] 152337f9 || + exec python3 -m determined.exec.launch_autohorovod
<> [2021-12-07 10:39:54] 30a87832 || + exec python3 -m determined.exec.launch_autohorovod
<info> [2021-12-07 10:39:54] 152337f9 || INFO: New trial runner in (container 152337f9-75e9-4d7b-84f1-f44b12f7309d) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<info> [2021-12-07 10:39:54] 30a87832 || INFO: New trial runner in (container 30a87832-9e8d-4020-8583-040e84a229a1) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<> [2021-12-07 10:39:56] 30a87832 || 2021-12-07 10:39:56.201596: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:39:56] 30a87832 || 2021-12-07 10:39:56.201642: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:39:58] 30a87832 || 2021-12-07 10:39:58.366589: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:39:58] 30a87832 || 2021-12-07 10:39:58.366634: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:40:01] 152337f9 [rank=1] || 2021-12-07 10:40:01,231:INFO [56]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<> [2021-12-07 10:40:01] 30a87832 [rank=0] || 2021-12-07 10:40:01,241:INFO [175]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<> [2021-12-07 10:40:01] 152337f9 [rank=1] || 2021-12-07 10:40:01.350328: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:40:01] 152337f9 [rank=1] || 2021-12-07 10:40:01.350366: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:40:01] 30a87832 [rank=0] || 2021-12-07 10:40:01.361846: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:40:01] 30a87832 [rank=0] || 2021-12-07 10:40:01.361881: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:40:03] 152337f9 [rank=1] || 2021-12-07 10:40:03,889:INFO [56]: Creating TFKerasTrialController with FlowerClassificationTrial.
<> [2021-12-07 10:40:03] 152337f9 [rank=1] || 2021-12-07 10:40:03.889393: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:40:03] 152337f9 [rank=1] || 2021-12-07 10:40:03.889658: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<> [2021-12-07 10:40:03] 152337f9 [rank=1] || 2021-12-07 10:40:03.889637: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<> [2021-12-07 10:40:03] 152337f9 [rank=1] || 2021-12-07 10:40:03.889689: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-sim): /proc/driver/nvidia/version does not exist
<> [2021-12-07 10:40:03] 152337f9 [rank=1] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<> [2021-12-07 10:40:03] 152337f9 [rank=1] || 2021-12-07 10:40:03.890611: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
<> [2021-12-07 10:40:03] 152337f9 [rank=1] || 2021-12-07 10:40:03.892165: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:40:03] 30a87832 [rank=0] || 2021-12-07 10:40:03,906:INFO [175]: Creating TFKerasTrialController with FlowerClassificationTrial.
<> [2021-12-07 10:40:03] 30a87832 [rank=0] || 2021-12-07 10:40:03.907125: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:40:03] 30a87832 [rank=0] || 2021-12-07 10:40:03.907483: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-mod): /proc/driver/nvidia/version does not exist
<> [2021-12-07 10:40:03] 30a87832 [rank=0] || 2021-12-07 10:40:03.907423: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<> [2021-12-07 10:40:03] 30a87832 [rank=0] || 2021-12-07 10:40:03.907447: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<> [2021-12-07 10:40:03] 30a87832 [rank=0] || 2021-12-07 10:40:03.908658: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
<> [2021-12-07 10:40:03] 30a87832 [rank=0] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<> [2021-12-07 10:40:03] 30a87832 [rank=0] || 2021-12-07 10:40:03.910072: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:40:03] 152337f9 [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:40:04] 152337f9 [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:40:04] 152337f9 [rank=1] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<> [2021-12-07 10:40:04] 152337f9 [rank=1] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<> [2021-12-07 10:40:04] 30a87832 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:40:04] 30a87832 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:40:04] 30a87832 [rank=0] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<> [2021-12-07 10:40:04] 30a87832 [rank=0] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<> [2021-12-07 10:40:04] 152337f9 [rank=1] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<> [2021-12-07 10:40:04] 30a87832 [rank=0] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<> [2021-12-07 10:40:05] 30a87832 [rank=0] || Sequential with layers obj made
<> [2021-12-07 10:40:05] 30a87832 [rank=0] || Wraped model in context
<> [2021-12-07 10:40:05] 30a87832 [rank=0] || Model compiled
<> [2021-12-07 10:40:05] 30a87832 [rank=0] || 2021-12-07 10:40:05,331:WARNING [175]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<> [2021-12-07 10:40:05] 30a87832 [rank=0] || total batches trained: 0, workload 0% complete (0/100)
<> [2021-12-07 10:40:05] 152337f9 [rank=1] || Sequential with layers obj made
<> [2021-12-07 10:40:05] 152337f9 [rank=1] || Wraped model in context
<> [2021-12-07 10:40:05] 152337f9 [rank=1] || Model compiled
<> [2021-12-07 10:40:05] 152337f9 [rank=1] || 2021-12-07 10:40:05,461:WARNING [56]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || 2021-12-07 10:40:07.045644: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || 2021-12-07 10:40:07.049928: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || Traceback (most recent call last):
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || return _run_code(code, main_globals, None,
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || exec(code, run_globals)
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || sys.exit(main(args.chief_ip))
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || controller.run()
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || self._launch_fit()
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || self.model.fit(
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || tmp_logs = self.train_function(iterator)
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || result = self._call(*args, **kwds)
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || return self._stateless_fn(*args, **kwds)
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || return graph_function._call_flat(
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || return self._build_call_outputs(self._inference_function.call(
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || outputs = execute.execute(
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
<> [2021-12-07 10:40:07] 30a87832 [rank=0] ||
<> [2021-12-07 10:40:07] 30a87832 [rank=0] ||
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || Function call stack:
<> [2021-12-07 10:40:07] 30a87832 [rank=0] || train_function
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || 2021-12-07 10:40:07.163460: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || 2021-12-07 10:40:07.167464: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || Traceback (most recent call last):
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || return _run_code(code, main_globals, None,
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || exec(code, run_globals)
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || sys.exit(main(args.chief_ip))
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || controller.run()
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || self.model.fit(
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || self._launch_fit()
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || tmp_logs = self.train_function(iterator)
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || result = self._call(*args, **kwds)
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || return self._stateless_fn(*args, **kwds)
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || return graph_function._call_flat(
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || return self._build_call_outputs(self._inference_function.call(
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || outputs = execute.execute(
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
<> [2021-12-07 10:40:07] 152337f9 [rank=1] ||
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || Function call stack:
<> [2021-12-07 10:40:07] 152337f9 [rank=1] || train_function
<> [2021-12-07 10:40:07] 152337f9 [rank=1] ||
<> [2021-12-07 10:40:07] 30a87832 || Process 0 exit with status code 1.
<> [2021-12-07 10:40:07] 30a87832 || Terminating remaining workers after failure of Process 0.
<> [2021-12-07 10:40:07] 30a87832 || Traceback (most recent call last):
<> [2021-12-07 10:40:07] 30a87832 || File "/opt/conda/bin/horovodrun", line 8, in <module>
<> [2021-12-07 10:40:07] 30a87832 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 770, in run_commandline
<> [2021-12-07 10:40:07] 30a87832 || sys.exit(run_commandline())
<> [2021-12-07 10:40:07] 30a87832 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 617, in _run_static
<> [2021-12-07 10:40:07] 30a87832 || _run(args)
<> [2021-12-07 10:40:07] 30a87832 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 760, in _run
<> [2021-12-07 10:40:07] 30a87832 || return _run_static(args)
<> [2021-12-07 10:40:07] 30a87832 || _launch_job(args, settings, nics, command)
<> [2021-12-07 10:40:07] 30a87832 || run_controller(args.use_gloo, gloo_run_fn,
<> [2021-12-07 10:40:07] 30a87832 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 730, in _launch_job
<> [2021-12-07 10:40:07] 30a87832 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 706, in run_controller
<> [2021-12-07 10:40:07] 30a87832 || gloo_run()
<> [2021-12-07 10:40:07] 30a87832 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 722, in gloo_run_fn
<> [2021-12-07 10:40:07] 30a87832 || launch_gloo(command, exec_command, settings, nics, env, server_ip)
<> [2021-12-07 10:40:07] 30a87832 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 298, in gloo_run
<> [2021-12-07 10:40:07] 30a87832 || gloo_run(settings, nics, env, driver_ip, command)
<> [2021-12-07 10:40:07] 30a87832 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 282, in launch_gloo
<> [2021-12-07 10:40:07] 30a87832 || Process name: 0
<> [2021-12-07 10:40:07] 30a87832 || RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
<> [2021-12-07 10:40:07] 30a87832 || raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
<> [2021-12-07 10:40:07] 30a87832 || Exit code: 1
<info> [2021-12-07 10:40:08] 30a87832 || INFO: container failed with non-zero exit code: (exit code 1)
<info> [2021-12-07 10:40:25] 152337f9 || INFO: container failed with non-zero exit code: (exit code 137)
<info> [2021-12-07 10:40:26] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Pod resources allocated.
<info> [2021-12-07 10:40:26] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Pod resources allocated.
<info> [2021-12-07 10:40:27] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:40:27] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:40:28] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 895.4494ms
<info> [2021-12-07 10:40:28] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 876.087326ms
<info> [2021-12-07 10:40:28] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Created container determined-init-container
<info> [2021-12-07 10:40:28] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Created container determined-init-container
<info> [2021-12-07 10:40:28] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Started container determined-init-container
<info> [2021-12-07 10:40:28] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Started container determined-init-container
<info> [2021-12-07 10:40:29] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Pulling image "fluent/fluent-bit:1.6"
<info> [2021-12-07 10:40:29] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Pulling image "fluent/fluent-bit:1.6"
<info> [2021-12-07 10:40:30] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Successfully pulled image "fluent/fluent-bit:1.6" in 1.149313965s
<info> [2021-12-07 10:40:30] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Created container determined-fluent-container
<info> [2021-12-07 10:40:30] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Started container determined-fluent-container
<info> [2021-12-07 10:40:30] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:40:30] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Successfully pulled image "fluent/fluent-bit:1.6" in 1.167861338s
<info> [2021-12-07 10:40:30] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Created container determined-fluent-container
<info> [2021-12-07 10:40:30] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Started container determined-fluent-container
<info> [2021-12-07 10:40:30] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:40:31] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 881.176011ms
<info> [2021-12-07 10:40:31] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Created container determined-container
<info> [2021-12-07 10:40:31] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Started container determined-container
<info> [2021-12-07 10:40:31] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 890.78712ms
<info> [2021-12-07 10:40:31] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Created container determined-container
<info> [2021-12-07 10:40:32] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Started container determined-container
<> [2021-12-07 10:40:33] c7698632 || + STARTUP_HOOK=startup-hook.sh
<> [2021-12-07 10:40:33] c7698632 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:40:33] c7698632 || + /bin/which python3
<> [2021-12-07 10:40:33] c7698632 || + export DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:40:33] c7698632 || + '[' -z '' ']'
<> [2021-12-07 10:40:33] c7698632 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:40:33] c7698632 || + DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:40:33] c7698632 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<> [2021-12-07 10:40:33] c7698632 || + '[' /root = / ']'
<> [2021-12-07 10:40:33] 514a56c5 || + STARTUP_HOOK=startup-hook.sh
<> [2021-12-07 10:40:33] 514a56c5 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:40:33] 514a56c5 || + DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:40:33] 514a56c5 || + '[' -z '' ']'
<> [2021-12-07 10:40:33] 514a56c5 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:40:33] 514a56c5 || + export DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:40:33] 514a56c5 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<> [2021-12-07 10:40:33] 514a56c5 || + '[' /root = / ']'
<> [2021-12-07 10:40:33] 514a56c5 || + /bin/which python3
<warning> [2021-12-07 10:40:33] c7698632 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<> [2021-12-07 10:40:33] c7698632 || + python3 -m determined.exec.prep_container --trial --resources
<warning> [2021-12-07 10:40:34] 514a56c5 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<> [2021-12-07 10:40:34] c7698632 || + test -f startup-hook.sh
<> [2021-12-07 10:40:34] c7698632 || + python3 -m determined.exec.prep_container --rendezvous
<> [2021-12-07 10:40:34] 514a56c5 || + python3 -m determined.exec.prep_container --trial --resources
<> [2021-12-07 10:40:34] 514a56c5 || + test -f startup-hook.sh
<> [2021-12-07 10:40:34] 514a56c5 || + python3 -m determined.exec.prep_container --rendezvous
<> [2021-12-07 10:40:34] 514a56c5 || + exec python3 -m determined.exec.launch_autohorovod
<> [2021-12-07 10:40:34] c7698632 || + exec python3 -m determined.exec.launch_autohorovod
<info> [2021-12-07 10:40:35] c7698632 || INFO: New trial runner in (container c7698632-a650-47bf-956b-f2faf2563c4e) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<info> [2021-12-07 10:40:35] 514a56c5 || INFO: New trial runner in (container 514a56c5-ee6c-4fd7-8e00-2444dc5d6b30) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<> [2021-12-07 10:40:36] 514a56c5 || 2021-12-07 10:40:36.940567: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:40:36] 514a56c5 || 2021-12-07 10:40:36.940617: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:40:39] 514a56c5 || 2021-12-07 10:40:39.175804: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:40:39] 514a56c5 || 2021-12-07 10:40:39.175851: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:40:42] c7698632 [rank=1] || 2021-12-07 10:40:42,133:INFO [56]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<> [2021-12-07 10:40:42] c7698632 [rank=1] || 2021-12-07 10:40:42.244000: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:40:42] c7698632 [rank=1] || 2021-12-07 10:40:42.244030: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:40:42] 514a56c5 [rank=0] || 2021-12-07 10:40:42,313:INFO [207]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<> [2021-12-07 10:40:42] 514a56c5 [rank=0] || 2021-12-07 10:40:42.457787: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:40:42] 514a56c5 [rank=0] || 2021-12-07 10:40:42.457823: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:40:45] c7698632 [rank=1] || 2021-12-07 10:40:45,061:INFO [56]: Creating TFKerasTrialController with FlowerClassificationTrial.
<> [2021-12-07 10:40:45] c7698632 [rank=1] || 2021-12-07 10:40:45.062263: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:40:45] c7698632 [rank=1] || 2021-12-07 10:40:45.062487: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<> [2021-12-07 10:40:45] c7698632 [rank=1] || 2021-12-07 10:40:45.062507: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<> [2021-12-07 10:40:45] c7698632 [rank=1] || 2021-12-07 10:40:45.062539: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-tou): /proc/driver/nvidia/version does not exist
<> [2021-12-07 10:40:45] c7698632 [rank=1] || 2021-12-07 10:40:45.063489: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
<> [2021-12-07 10:40:45] c7698632 [rank=1] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<> [2021-12-07 10:40:45] 514a56c5 [rank=0] || 2021-12-07 10:40:45,063:INFO [207]: Creating TFKerasTrialController with FlowerClassificationTrial.
<> [2021-12-07 10:40:45] 514a56c5 [rank=0] || 2021-12-07 10:40:45.063814: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:40:45] 514a56c5 [rank=0] || 2021-12-07 10:40:45.064067: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<> [2021-12-07 10:40:45] 514a56c5 [rank=0] || 2021-12-07 10:40:45.064046: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<> [2021-12-07 10:40:45] 514a56c5 [rank=0] || 2021-12-07 10:40:45.064098: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-tho): /proc/driver/nvidia/version does not exist
<> [2021-12-07 10:40:45] 514a56c5 [rank=0] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<> [2021-12-07 10:40:45] 514a56c5 [rank=0] || 2021-12-07 10:40:45.065006: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
<> [2021-12-07 10:40:45] c7698632 [rank=1] || 2021-12-07 10:40:45.066139: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:40:45] 514a56c5 [rank=0] || 2021-12-07 10:40:45.068060: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:40:45] c7698632 [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:40:45] 514a56c5 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:40:45] c7698632 [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:40:45] c7698632 [rank=1] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<> [2021-12-07 10:40:45] c7698632 [rank=1] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<> [2021-12-07 10:40:45] 514a56c5 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:40:45] 514a56c5 [rank=0] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<> [2021-12-07 10:40:45] 514a56c5 [rank=0] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<> [2021-12-07 10:40:45] c7698632 [rank=1] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<> [2021-12-07 10:40:45] 514a56c5 [rank=0] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<> [2021-12-07 10:40:46] c7698632 [rank=1] || Sequential with layers obj made
<> [2021-12-07 10:40:46] c7698632 [rank=1] || Wraped model in context
<> [2021-12-07 10:40:46] c7698632 [rank=1] || Model compiled
<> [2021-12-07 10:40:46] c7698632 [rank=1] || 2021-12-07 10:40:46,543:WARNING [56]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<> [2021-12-07 10:40:46] 514a56c5 [rank=0] || Sequential with layers obj made
<> [2021-12-07 10:40:46] 514a56c5 [rank=0] || Wraped model in context
<> [2021-12-07 10:40:46] 514a56c5 [rank=0] || Model compiled
<> [2021-12-07 10:40:46] 514a56c5 [rank=0] || 2021-12-07 10:40:46,665:WARNING [207]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<> [2021-12-07 10:40:46] 514a56c5 [rank=0] || total batches trained: 0, workload 0% complete (0/100)
<> [2021-12-07 10:40:48] c7698632 [rank=1] || 2021-12-07 10:40:48.421840: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<> [2021-12-07 10:40:48] c7698632 [rank=1] || 2021-12-07 10:40:48.425828: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || 2021-12-07 10:40:48.439657: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || 2021-12-07 10:40:48.444046: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<> [2021-12-07 10:40:48] c7698632 [rank=1] || Traceback (most recent call last):
<> [2021-12-07 10:40:48] c7698632 [rank=1] || File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<> [2021-12-07 10:40:48] c7698632 [rank=1] || return _run_code(code, main_globals, None,
<> [2021-12-07 10:40:48] c7698632 [rank=1] || File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<> [2021-12-07 10:40:48] c7698632 [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<> [2021-12-07 10:40:48] c7698632 [rank=1] || exec(code, run_globals)
<> [2021-12-07 10:40:48] c7698632 [rank=1] || sys.exit(main(args.chief_ip))
<> [2021-12-07 10:40:48] c7698632 [rank=1] || controller.run()
<> [2021-12-07 10:40:48] c7698632 [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<> [2021-12-07 10:40:48] c7698632 [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<> [2021-12-07 10:40:48] c7698632 [rank=1] || self._launch_fit()
<> [2021-12-07 10:40:48] c7698632 [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<> [2021-12-07 10:40:48] c7698632 [rank=1] || self.model.fit(
<> [2021-12-07 10:40:48] c7698632 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<> [2021-12-07 10:40:48] c7698632 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<> [2021-12-07 10:40:48] c7698632 [rank=1] || tmp_logs = self.train_function(iterator)
<> [2021-12-07 10:40:48] c7698632 [rank=1] || result = self._call(*args, **kwds)
<> [2021-12-07 10:40:48] c7698632 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<> [2021-12-07 10:40:48] c7698632 [rank=1] || return self._stateless_fn(*args, **kwds)
<> [2021-12-07 10:40:48] c7698632 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<> [2021-12-07 10:40:48] c7698632 [rank=1] || return graph_function._call_flat(
<> [2021-12-07 10:40:48] c7698632 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<> [2021-12-07 10:40:48] c7698632 [rank=1] || return self._build_call_outputs(self._inference_function.call(
<> [2021-12-07 10:40:48] c7698632 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<> [2021-12-07 10:40:48] c7698632 [rank=1] || outputs = execute.execute(
<> [2021-12-07 10:40:48] c7698632 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<> [2021-12-07 10:40:48] c7698632 [rank=1] || tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<> [2021-12-07 10:40:48] c7698632 [rank=1] || tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
<> [2021-12-07 10:40:48] c7698632 [rank=1] || Function call stack:
<> [2021-12-07 10:40:48] c7698632 [rank=1] || [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<> [2021-12-07 10:40:48] c7698632 [rank=1] ||
<> [2021-12-07 10:40:48] c7698632 [rank=1] ||
<> [2021-12-07 10:40:48] c7698632 [rank=1] || train_function
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || Traceback (most recent call last):
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || return _run_code(code, main_globals, None,
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || exec(code, run_globals)
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || sys.exit(main(args.chief_ip))
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || controller.run()
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || self._launch_fit()
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || self.model.fit(
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || tmp_logs = self.train_function(iterator)
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || result = self._call(*args, **kwds)
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || return self._stateless_fn(*args, **kwds)
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || return graph_function._call_flat(
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || return self._build_call_outputs(self._inference_function.call(
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || outputs = execute.execute(
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || train_function
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] ||
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] || Function call stack:
<> [2021-12-07 10:40:48] 514a56c5 [rank=0] ||
<> [2021-12-07 10:40:49] 514a56c5 || Process 1 exit with status code 1.
<> [2021-12-07 10:40:49] 514a56c5 || Terminating remaining workers after failure of Process 1.
<> [2021-12-07 10:40:49] 514a56c5 || [0]<stderr>:Terminated
<> [2021-12-07 10:40:49] 514a56c5 || Process 0 exit with status code 143.
<> [2021-12-07 10:40:49] 514a56c5 || Traceback (most recent call last):
<> [2021-12-07 10:40:49] 514a56c5 || File "/opt/conda/bin/horovodrun", line 8, in <module>
<> [2021-12-07 10:40:49] 514a56c5 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 770, in run_commandline
<> [2021-12-07 10:40:49] 514a56c5 || sys.exit(run_commandline())
<> [2021-12-07 10:40:49] 514a56c5 || _run(args)
<> [2021-12-07 10:40:49] 514a56c5 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 760, in _run
<> [2021-12-07 10:40:49] 514a56c5 || return _run_static(args)
<> [2021-12-07 10:40:49] 514a56c5 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 617, in _run_static
<> [2021-12-07 10:40:49] 514a56c5 || _launch_job(args, settings, nics, command)
<> [2021-12-07 10:40:49] 514a56c5 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 730, in _launch_job
<> [2021-12-07 10:40:49] 514a56c5 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 706, in run_controller
<> [2021-12-07 10:40:49] 514a56c5 || run_controller(args.use_gloo, gloo_run_fn,
<> [2021-12-07 10:40:49] 514a56c5 || gloo_run()
<> [2021-12-07 10:40:49] 514a56c5 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 722, in gloo_run_fn
<> [2021-12-07 10:40:49] 514a56c5 || gloo_run(settings, nics, env, driver_ip, command)
<> [2021-12-07 10:40:49] 514a56c5 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 298, in gloo_run
<> [2021-12-07 10:40:49] 514a56c5 || launch_gloo(command, exec_command, settings, nics, env, server_ip)
<> [2021-12-07 10:40:49] 514a56c5 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 282, in launch_gloo
<> [2021-12-07 10:40:49] 514a56c5 || raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
<> [2021-12-07 10:40:49] 514a56c5 || Process name: 1
<> [2021-12-07 10:40:49] 514a56c5 || RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
<> [2021-12-07 10:40:49] 514a56c5 || Exit code: 1
<info> [2021-12-07 10:40:51] 514a56c5 || INFO: container failed with non-zero exit code: (exit code 1)
<info> [2021-12-07 10:41:06] c7698632 || INFO: container failed with non-zero exit code: (exit code 137)
<info> [2021-12-07 10:41:08] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Pod resources allocated.
<info> [2021-12-07 10:41:08] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Pod resources allocated.
<info> [2021-12-07 10:41:09] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:41:09] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:41:10] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 874.693887ms
<info> [2021-12-07 10:41:10] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Created container determined-init-container
<info> [2021-12-07 10:41:10] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 890.053052ms
<info> [2021-12-07 10:41:10] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Created container determined-init-container
<info> [2021-12-07 10:41:10] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Started container determined-init-container
<info> [2021-12-07 10:41:10] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Started container determined-init-container
<info> [2021-12-07 10:41:11] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Pulling image "fluent/fluent-bit:1.6"
<info> [2021-12-07 10:41:11] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Pulling image "fluent/fluent-bit:1.6"
<info> [2021-12-07 10:41:12] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Successfully pulled image "fluent/fluent-bit:1.6" in 1.159570999s
<info> [2021-12-07 10:41:12] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Created container determined-fluent-container
<info> [2021-12-07 10:41:12] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Started container determined-fluent-container
<info> [2021-12-07 10:41:12] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:41:12] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Successfully pulled image "fluent/fluent-bit:1.6" in 1.17294199s
<info> [2021-12-07 10:41:12] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Created container determined-fluent-container
<info> [2021-12-07 10:41:13] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Started container determined-fluent-container
<info> [2021-12-07 10:41:13] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:41:13] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 877.471658ms
<info> [2021-12-07 10:41:13] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Created container determined-container
<info> [2021-12-07 10:41:13] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Started container determined-container
<info> [2021-12-07 10:41:13] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 907.551689ms
<info> [2021-12-07 10:41:14] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Created container determined-container
<info> [2021-12-07 10:41:14] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Started container determined-container
<> [2021-12-07 10:41:15] d173154e || + STARTUP_HOOK=startup-hook.sh
<> [2021-12-07 10:41:15] d173154e || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:41:15] d173154e || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:41:15] d173154e || + DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:41:15] d173154e || + export DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:41:15] d173154e || + /bin/which python3
<> [2021-12-07 10:41:15] d173154e || + '[' -z '' ']'
<> [2021-12-07 10:41:15] d173154e || + '[' /root = / ']'
<> [2021-12-07 10:41:15] d173154e || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<> [2021-12-07 10:41:15] 5fabfb53 || + STARTUP_HOOK=startup-hook.sh
<> [2021-12-07 10:41:15] 5fabfb53 || + '[' -z '' ']'
<> [2021-12-07 10:41:15] 5fabfb53 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:41:15] 5fabfb53 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:41:15] 5fabfb53 || + DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:41:15] 5fabfb53 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<> [2021-12-07 10:41:15] 5fabfb53 || + '[' /root = / ']'
<> [2021-12-07 10:41:15] 5fabfb53 || + export DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:41:15] 5fabfb53 || + /bin/which python3
<warning> [2021-12-07 10:41:15] d173154e || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<> [2021-12-07 10:41:16] d173154e || + python3 -m determined.exec.prep_container --trial --resources
<> [2021-12-07 10:41:16] d173154e || + test -f startup-hook.sh
<> [2021-12-07 10:41:16] d173154e || + python3 -m determined.exec.prep_container --rendezvous
<warning> [2021-12-07 10:41:16] 5fabfb53 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<> [2021-12-07 10:41:16] 5fabfb53 || + python3 -m determined.exec.prep_container --trial --resources
<> [2021-12-07 10:41:17] 5fabfb53 || + test -f startup-hook.sh
<> [2021-12-07 10:41:17] 5fabfb53 || + python3 -m determined.exec.prep_container --rendezvous
<> [2021-12-07 10:41:17] d173154e || + exec python3 -m determined.exec.launch_autohorovod
<> [2021-12-07 10:41:17] 5fabfb53 || + exec python3 -m determined.exec.launch_autohorovod
<info> [2021-12-07 10:41:17] d173154e || INFO: New trial runner in (container d173154e-3575-4d2e-8a49-d50c77092e5a) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<info> [2021-12-07 10:41:17] 5fabfb53 || INFO: New trial runner in (container 5fabfb53-a425-43b6-8bf5-3ba30a33917d) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<> [2021-12-07 10:41:19] 5fabfb53 || 2021-12-07 10:41:19.622471: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:41:19] 5fabfb53 || 2021-12-07 10:41:19.622532: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:41:21] 5fabfb53 || 2021-12-07 10:41:21.869342: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:41:21] 5fabfb53 || 2021-12-07 10:41:21.869397: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:41:24] 5fabfb53 [rank=0] || 2021-12-07 10:41:24,869:INFO [207]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<> [2021-12-07 10:41:24] 5fabfb53 [rank=0] || 2021-12-07 10:41:24.998598: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:41:24] 5fabfb53 [rank=0] || 2021-12-07 10:41:24.998635: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:41:25] d173154e [rank=1] || 2021-12-07 10:41:25,019:INFO [56]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<> [2021-12-07 10:41:25] d173154e [rank=1] || 2021-12-07 10:41:25.138037: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:41:25] d173154e [rank=1] || 2021-12-07 10:41:25.138069: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:41:27] 5fabfb53 [rank=0] || 2021-12-07 10:41:27,732:INFO [207]: Creating TFKerasTrialController with FlowerClassificationTrial.
<> [2021-12-07 10:41:27] 5fabfb53 [rank=0] || 2021-12-07 10:41:27.733351: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:41:27] 5fabfb53 [rank=0] || 2021-12-07 10:41:27.733638: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<> [2021-12-07 10:41:27] 5fabfb53 [rank=0] || 2021-12-07 10:41:27.733663: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<> [2021-12-07 10:41:27] 5fabfb53 [rank=0] || 2021-12-07 10:41:27.733701: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sha): /proc/driver/nvidia/version does not exist
<> [2021-12-07 10:41:27] 5fabfb53 [rank=0] || 2021-12-07 10:41:27.734901: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
<> [2021-12-07 10:41:27] 5fabfb53 [rank=0] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<> [2021-12-07 10:41:27] 5fabfb53 [rank=0] || 2021-12-07 10:41:27.736524: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:41:27] d173154e [rank=1] || 2021-12-07 10:41:27,740:INFO [56]: Creating TFKerasTrialController with FlowerClassificationTrial.
<> [2021-12-07 10:41:27] d173154e [rank=1] || 2021-12-07 10:41:27.740664: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:41:27] d173154e [rank=1] || 2021-12-07 10:41:27.740919: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<> [2021-12-07 10:41:27] d173154e [rank=1] || 2021-12-07 10:41:27.740947: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<> [2021-12-07 10:41:27] d173154e [rank=1] || 2021-12-07 10:41:27.740980: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-fra): /proc/driver/nvidia/version does not exist
<> [2021-12-07 10:41:27] d173154e [rank=1] || 2021-12-07 10:41:27.742130: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
<> [2021-12-07 10:41:27] d173154e [rank=1] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<> [2021-12-07 10:41:27] d173154e [rank=1] || 2021-12-07 10:41:27.743424: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:41:27] 5fabfb53 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:41:27] d173154e [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:41:27] 5fabfb53 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:41:27] 5fabfb53 [rank=0] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<> [2021-12-07 10:41:27] 5fabfb53 [rank=0] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<> [2021-12-07 10:41:27] d173154e [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:41:27] d173154e [rank=1] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<> [2021-12-07 10:41:27] d173154e [rank=1] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<> [2021-12-07 10:41:28] d173154e [rank=1] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<> [2021-12-07 10:41:28] 5fabfb53 [rank=0] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<> [2021-12-07 10:41:29] d173154e [rank=1] || Sequential with layers obj made
<> [2021-12-07 10:41:29] d173154e [rank=1] || Wraped model in context
<> [2021-12-07 10:41:29] d173154e [rank=1] || Model compiled
<> [2021-12-07 10:41:29] d173154e [rank=1] || 2021-12-07 10:41:29,300:WARNING [56]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<> [2021-12-07 10:41:29] 5fabfb53 [rank=0] || Sequential with layers obj made
<> [2021-12-07 10:41:29] 5fabfb53 [rank=0] || Wraped model in context
<> [2021-12-07 10:41:29] 5fabfb53 [rank=0] || Model compiled
<> [2021-12-07 10:41:29] 5fabfb53 [rank=0] || 2021-12-07 10:41:29,375:WARNING [207]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<> [2021-12-07 10:41:29] 5fabfb53 [rank=0] || total batches trained: 0, workload 0% complete (0/100)
<> [2021-12-07 10:41:31] d173154e [rank=1] || 2021-12-07 10:41:31.057916: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<> [2021-12-07 10:41:31] d173154e [rank=1] || 2021-12-07 10:41:31.062159: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<> [2021-12-07 10:41:31] d173154e [rank=1] || Traceback (most recent call last):
<> [2021-12-07 10:41:31] d173154e [rank=1] || File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<> [2021-12-07 10:41:31] d173154e [rank=1] || return _run_code(code, main_globals, None,
<> [2021-12-07 10:41:31] d173154e [rank=1] || File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<> [2021-12-07 10:41:31] d173154e [rank=1] || exec(code, run_globals)
<> [2021-12-07 10:41:31] d173154e [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<> [2021-12-07 10:41:31] d173154e [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<> [2021-12-07 10:41:31] d173154e [rank=1] || sys.exit(main(args.chief_ip))
<> [2021-12-07 10:41:31] d173154e [rank=1] || controller.run()
<> [2021-12-07 10:41:31] d173154e [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<> [2021-12-07 10:41:31] d173154e [rank=1] || self._launch_fit()
<> [2021-12-07 10:41:31] d173154e [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<> [2021-12-07 10:41:31] d173154e [rank=1] || self.model.fit(
<> [2021-12-07 10:41:31] d173154e [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<> [2021-12-07 10:41:31] d173154e [rank=1] || tmp_logs = self.train_function(iterator)
<> [2021-12-07 10:41:31] d173154e [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<> [2021-12-07 10:41:31] d173154e [rank=1] || result = self._call(*args, **kwds)
<> [2021-12-07 10:41:31] d173154e [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<> [2021-12-07 10:41:31] d173154e [rank=1] || return self._stateless_fn(*args, **kwds)
<> [2021-12-07 10:41:31] d173154e [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<> [2021-12-07 10:41:31] d173154e [rank=1] || return graph_function._call_flat(
<> [2021-12-07 10:41:31] d173154e [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<> [2021-12-07 10:41:31] d173154e [rank=1] || return self._build_call_outputs(self._inference_function.call(
<> [2021-12-07 10:41:31] d173154e [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<> [2021-12-07 10:41:31] d173154e [rank=1] || outputs = execute.execute(
<> [2021-12-07 10:41:31] d173154e [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<> [2021-12-07 10:41:31] d173154e [rank=1] || tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<> [2021-12-07 10:41:31] d173154e [rank=1] || tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
<> [2021-12-07 10:41:31] d173154e [rank=1] ||
<> [2021-12-07 10:41:31] d173154e [rank=1] || [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<> [2021-12-07 10:41:31] d173154e [rank=1] || Function call stack:
<> [2021-12-07 10:41:31] d173154e [rank=1] || train_function
<> [2021-12-07 10:41:31] d173154e [rank=1] ||
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || 2021-12-07 10:41:31.342307: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || 2021-12-07 10:41:31.347290: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || Traceback (most recent call last):
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || return _run_code(code, main_globals, None,
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || exec(code, run_globals)
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || sys.exit(main(args.chief_ip))
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || controller.run()
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || self._launch_fit()
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || self.model.fit(
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || tmp_logs = self.train_function(iterator)
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || result = self._call(*args, **kwds)
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || return self._stateless_fn(*args, **kwds)
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || return graph_function._call_flat(
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || return self._build_call_outputs(self._inference_function.call(
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || outputs = execute.execute(
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || Function call stack:
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] || train_function
<> [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||
<> [2021-12-07 10:41:32] 5fabfb53 || Process 1 exit with status code 1.
<> [2021-12-07 10:41:32] 5fabfb53 || Terminating remaining workers after failure of Process 1.
<> [2021-12-07 10:41:32] 5fabfb53 || [0]<stderr>:Terminated
<> [2021-12-07 10:41:32] 5fabfb53 || Process 0 exit with status code 143.
<> [2021-12-07 10:41:32] 5fabfb53 || Traceback (most recent call last):
<> [2021-12-07 10:41:32] 5fabfb53 || File "/opt/conda/bin/horovodrun", line 8, in <module>
<> [2021-12-07 10:41:32] 5fabfb53 || sys.exit(run_commandline())
<> [2021-12-07 10:41:32] 5fabfb53 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 770, in run_commandline
<> [2021-12-07 10:41:32] 5fabfb53 || _run(args)
<> [2021-12-07 10:41:32] 5fabfb53 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 760, in _run
<> [2021-12-07 10:41:32] 5fabfb53 || return _run_static(args)
<> [2021-12-07 10:41:32] 5fabfb53 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 617, in _run_static
<> [2021-12-07 10:41:32] 5fabfb53 || _launch_job(args, settings, nics, command)
<> [2021-12-07 10:41:32] 5fabfb53 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 730, in _launch_job
<> [2021-12-07 10:41:32] 5fabfb53 || run_controller(args.use_gloo, gloo_run_fn,
<> [2021-12-07 10:41:32] 5fabfb53 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 706, in run_controller
<> [2021-12-07 10:41:32] 5fabfb53 || gloo_run()
<> [2021-12-07 10:41:32] 5fabfb53 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 722, in gloo_run_fn
<> [2021-12-07 10:41:32] 5fabfb53 || gloo_run(settings, nics, env, driver_ip, command)
<> [2021-12-07 10:41:32] 5fabfb53 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 298, in gloo_run
<> [2021-12-07 10:41:32] 5fabfb53 || launch_gloo(command, exec_command, settings, nics, env, server_ip)
<> [2021-12-07 10:41:32] 5fabfb53 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 282, in launch_gloo
<> [2021-12-07 10:41:32] 5fabfb53 || raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
<> [2021-12-07 10:41:32] 5fabfb53 || RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
<> [2021-12-07 10:41:32] 5fabfb53 || Process name: 1
<> [2021-12-07 10:41:32] 5fabfb53 || Exit code: 1
<info> [2021-12-07 10:41:36] 5fabfb53 || INFO: container failed with non-zero exit code: (exit code 1)
<info> [2021-12-07 10:41:53] d173154e || INFO: container failed with non-zero exit code: (exit code 137)
<info> [2021-12-07 10:41:54] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Pod resources allocated.
<info> [2021-12-07 10:41:54] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Pod resources allocated.
<info> [2021-12-07 10:41:55] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:41:55] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:41:56] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 901.4561ms
<info> [2021-12-07 10:41:56] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 890.545699ms
<info> [2021-12-07 10:41:56] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Created container determined-init-container
<info> [2021-12-07 10:41:56] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Created container determined-init-container
<info> [2021-12-07 10:41:56] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Started container determined-init-container
<info> [2021-12-07 10:41:56] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Started container determined-init-container
<info> [2021-12-07 10:41:56] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Pulling image "fluent/fluent-bit:1.6"
<info> [2021-12-07 10:41:57] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Pulling image "fluent/fluent-bit:1.6"
<info> [2021-12-07 10:41:58] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Successfully pulled image "fluent/fluent-bit:1.6" in 1.160234547s
<info> [2021-12-07 10:41:58] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Created container determined-fluent-container
<info> [2021-12-07 10:41:58] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Started container determined-fluent-container
<info> [2021-12-07 10:41:58] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:41:58] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Successfully pulled image "fluent/fluent-bit:1.6" in 1.178977281s
<info> [2021-12-07 10:41:58] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Created container determined-fluent-container
<info> [2021-12-07 10:41:58] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Started container determined-fluent-container
<info> [2021-12-07 10:41:58] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:41:59] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 873.101702ms
<info> [2021-12-07 10:41:59] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Created container determined-container
<info> [2021-12-07 10:41:59] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Started container determined-container
<info> [2021-12-07 10:41:59] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 886.217585ms
<info> [2021-12-07 10:41:59] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Created container determined-container
<info> [2021-12-07 10:41:59] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Started container determined-container
<> [2021-12-07 10:42:00] 6ed813d5 || + STARTUP_HOOK=startup-hook.sh
<> [2021-12-07 10:42:00] 6ed813d5 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:42:00] 6ed813d5 || + DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:42:00] 6ed813d5 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:42:00] 6ed813d5 || + '[' -z '' ']'
<> [2021-12-07 10:42:00] 6ed813d5 || + export DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:42:00] 6ed813d5 || + /bin/which python3
<> [2021-12-07 10:42:00] 6ed813d5 || + '[' /root = / ']'
<> [2021-12-07 10:42:00] 6ed813d5 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<> [2021-12-07 10:42:01] 1d0cefc1 || + STARTUP_HOOK=startup-hook.sh
<> [2021-12-07 10:42:01] 1d0cefc1 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:42:01] 1d0cefc1 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:42:01] 1d0cefc1 || + DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:42:01] 1d0cefc1 || + '[' -z '' ']'
<> [2021-12-07 10:42:01] 1d0cefc1 || + export DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:42:01] 1d0cefc1 || + /bin/which python3
<> [2021-12-07 10:42:01] 1d0cefc1 || + '[' /root = / ']'
<> [2021-12-07 10:42:01] 1d0cefc1 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<warning> [2021-12-07 10:42:01] 6ed813d5 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<> [2021-12-07 10:42:01] 6ed813d5 || + python3 -m determined.exec.prep_container --trial --resources
<warning> [2021-12-07 10:42:01] 1d0cefc1 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<> [2021-12-07 10:42:02] 6ed813d5 || + test -f startup-hook.sh
<> [2021-12-07 10:42:02] 6ed813d5 || + python3 -m determined.exec.prep_container --rendezvous
<> [2021-12-07 10:42:02] 1d0cefc1 || + python3 -m determined.exec.prep_container --trial --resources
<> [2021-12-07 10:42:02] 1d0cefc1 || + test -f startup-hook.sh
<> [2021-12-07 10:42:02] 1d0cefc1 || + python3 -m determined.exec.prep_container --rendezvous
<> [2021-12-07 10:42:02] 1d0cefc1 || + exec python3 -m determined.exec.launch_autohorovod
<> [2021-12-07 10:42:02] 6ed813d5 || + exec python3 -m determined.exec.launch_autohorovod
<info> [2021-12-07 10:42:02] 1d0cefc1 || INFO: New trial runner in (container 1d0cefc1-5ce1-4fec-9171-4d6addc9d458) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<info> [2021-12-07 10:42:02] 6ed813d5 || INFO: New trial runner in (container 6ed813d5-5d25-4ce9-a9fc-d3edb66de7a1) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<> [2021-12-07 10:42:04] 6ed813d5 || 2021-12-07 10:42:04.677820: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:42:04] 6ed813d5 || 2021-12-07 10:42:04.677866: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:42:06] 6ed813d5 || 2021-12-07 10:42:06.730998: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:42:06] 6ed813d5 || 2021-12-07 10:42:06.731044: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:42:09] 6ed813d5 [rank=0] || 2021-12-07 10:42:09,844:INFO [207]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<> [2021-12-07 10:42:09] 1d0cefc1 [rank=1] || 2021-12-07 10:42:09,940:INFO [56]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<> [2021-12-07 10:42:09] 6ed813d5 [rank=0] || 2021-12-07 10:42:09.952059: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:42:09] 6ed813d5 [rank=0] || 2021-12-07 10:42:09.952094: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:42:10] 1d0cefc1 [rank=1] || 2021-12-07 10:42:10.055760: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:42:10] 1d0cefc1 [rank=1] || 2021-12-07 10:42:10.055787: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:42:12] 6ed813d5 [rank=0] || 2021-12-07 10:42:12,378:INFO [207]: Creating TFKerasTrialController with FlowerClassificationTrial.
<> [2021-12-07 10:42:12] 6ed813d5 [rank=0] || 2021-12-07 10:42:12.379302: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:42:12] 6ed813d5 [rank=0] || 2021-12-07 10:42:12.379527: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<> [2021-12-07 10:42:12] 6ed813d5 [rank=0] || 2021-12-07 10:42:12.379547: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<> [2021-12-07 10:42:12] 6ed813d5 [rank=0] || 2021-12-07 10:42:12.379580: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-cap): /proc/driver/nvidia/version does not exist
<> [2021-12-07 10:42:12] 6ed813d5 [rank=0] || 2021-12-07 10:42:12.380470: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
<> [2021-12-07 10:42:12] 6ed813d5 [rank=0] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<> [2021-12-07 10:42:12] 6ed813d5 [rank=0] || 2021-12-07 10:42:12.383423: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || 2021-12-07 10:42:12,385:INFO [56]: Creating TFKerasTrialController with FlowerClassificationTrial.
<> [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || 2021-12-07 10:42:12.385478: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || 2021-12-07 10:42:12.385740: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<> [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || 2021-12-07 10:42:12.385760: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<> [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || 2021-12-07 10:42:12.385787: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-max): /proc/driver/nvidia/version does not exist
<> [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || 2021-12-07 10:42:12.386672: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
<> [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<> [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || 2021-12-07 10:42:12.388711: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:42:12] 6ed813d5 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<> [2021-12-07 10:42:12] 6ed813d5 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:42:12] 6ed813d5 [rank=0] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<> [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<> [2021-12-07 10:42:12] 6ed813d5 [rank=0] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<> [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<> [2021-12-07 10:42:12] 6ed813d5 [rank=0] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<> [2021-12-07 10:42:13] 1d0cefc1 [rank=1] || Sequential with layers obj made
<> [2021-12-07 10:42:13] 1d0cefc1 [rank=1] || Wraped model in context
<> [2021-12-07 10:42:13] 1d0cefc1 [rank=1] || Model compiled
<> [2021-12-07 10:42:13] 6ed813d5 [rank=0] || Sequential with layers obj made
<> [2021-12-07 10:42:13] 6ed813d5 [rank=0] || Wraped model in context
<> [2021-12-07 10:42:13] 6ed813d5 [rank=0] || Model compiled
<> [2021-12-07 10:42:13] 1d0cefc1 [rank=1] || 2021-12-07 10:42:13,946:WARNING [56]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<> [2021-12-07 10:42:13] 6ed813d5 [rank=0] || 2021-12-07 10:42:13,964:WARNING [207]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<> [2021-12-07 10:42:13] 6ed813d5 [rank=0] || total batches trained: 0, workload 0% complete (0/100)
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || 2021-12-07 10:42:15.575952: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || 2021-12-07 10:42:15.579902: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || Traceback (most recent call last):
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || return _run_code(code, main_globals, None,
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || exec(code, run_globals)
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || sys.exit(main(args.chief_ip))
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || controller.run()
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || self._launch_fit()
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || self.model.fit(
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || tmp_logs = self.train_function(iterator)
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || result = self._call(*args, **kwds)
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || return self._stateless_fn(*args, **kwds)
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || return graph_function._call_flat(
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || return self._build_call_outputs(self._inference_function.call(
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || outputs = execute.execute(
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || train_function
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || Function call stack:
<> [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || 2021-12-07 10:42:15.741891: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || 2021-12-07 10:42:15.745930: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || Traceback (most recent call last):
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || return _run_code(code, main_globals, None,
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || exec(code, run_globals)
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || sys.exit(main(args.chief_ip))
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || controller.run()
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || self._launch_fit()
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || self.model.fit(
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || tmp_logs = self.train_function(iterator)
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || result = self._call(*args, **kwds)
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || return self._stateless_fn(*args, **kwds)
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || return graph_function._call_flat(
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || return self._build_call_outputs(self._inference_function.call(
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || outputs = execute.execute(
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || Function call stack:
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] || train_function
<> [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||
<> [2021-12-07 10:42:16] 6ed813d5 || Process 1 exit with status code 1.
<> [2021-12-07 10:42:16] 6ed813d5 || Terminating remaining workers after failure of Process 1.
<> [2021-12-07 10:42:16] 6ed813d5 || [0]<stderr>:Terminated
<> [2021-12-07 10:42:16] 6ed813d5 || Process 0 exit with status code 143.
<> [2021-12-07 10:42:16] 6ed813d5 || Traceback (most recent call last):
<> [2021-12-07 10:42:16] 6ed813d5 || File "/opt/conda/bin/horovodrun", line 8, in <module>
<> [2021-12-07 10:42:16] 6ed813d5 || sys.exit(run_commandline())
<> [2021-12-07 10:42:16] 6ed813d5 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 770, in run_commandline
<> [2021-12-07 10:42:16] 6ed813d5 || _run(args)
<> [2021-12-07 10:42:16] 6ed813d5 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 760, in _run
<> [2021-12-07 10:42:16] 6ed813d5 || return _run_static(args)
<> [2021-12-07 10:42:16] 6ed813d5 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 617, in _run_static
<> [2021-12-07 10:42:16] 6ed813d5 || _launch_job(args, settings, nics, command)
<> [2021-12-07 10:42:16] 6ed813d5 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 730, in _launch_job
<> [2021-12-07 10:42:16] 6ed813d5 || run_controller(args.use_gloo, gloo_run_fn,
<> [2021-12-07 10:42:16] 6ed813d5 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 706, in run_controller
<> [2021-12-07 10:42:16] 6ed813d5 || gloo_run()
<> [2021-12-07 10:42:16] 6ed813d5 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 722, in gloo_run_fn
<> [2021-12-07 10:42:16] 6ed813d5 || gloo_run(settings, nics, env, driver_ip, command)
<> [2021-12-07 10:42:16] 6ed813d5 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 298, in gloo_run
<> [2021-12-07 10:42:16] 6ed813d5 || launch_gloo(command, exec_command, settings, nics, env, server_ip)
<> [2021-12-07 10:42:16] 6ed813d5 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 282, in launch_gloo
<> [2021-12-07 10:42:16] 6ed813d5 || Process name: 1
<> [2021-12-07 10:42:16] 6ed813d5 || raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
<> [2021-12-07 10:42:16] 6ed813d5 || RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
<> [2021-12-07 10:42:16] 6ed813d5 || Exit code: 1
<info> [2021-12-07 10:42:20] 6ed813d5 || INFO: container failed with non-zero exit code: (exit code 1)
<info> [2021-12-07 10:42:36] 1d0cefc1 || INFO: rpc error: code = Unknown desc = Error: No such container: 8c37c94994ab83ed1ae13fbba12b7ec578361f0db5da1a5ea49e91dd205bbc4b
<info> [2021-12-07 10:42:37] 1d0cefc1 || INFO: container failed with non-zero exit code: (exit code 137)
<info> [2021-12-07 10:42:38] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Pod resources allocated.
<info> [2021-12-07 10:42:38] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Pod resources allocated.
<info> [2021-12-07 10:42:39] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:42:39] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:42:40] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 876.439508ms
<info> [2021-12-07 10:42:40] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 871.463514ms
<info> [2021-12-07 10:42:40] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Created container determined-init-container
<info> [2021-12-07 10:42:40] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Created container determined-init-container
<info> [2021-12-07 10:42:40] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Started container determined-init-container
<info> [2021-12-07 10:42:40] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Started container determined-init-container
<info> [2021-12-07 10:42:40] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Pulling image "fluent/fluent-bit:1.6"
<info> [2021-12-07 10:42:41] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Pulling image "fluent/fluent-bit:1.6"
<info> [2021-12-07 10:42:42] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Successfully pulled image "fluent/fluent-bit:1.6" in 1.177963649s
<info> [2021-12-07 10:42:42] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Created container determined-fluent-container
<info> [2021-12-07 10:42:42] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Started container determined-fluent-container
<info> [2021-12-07 10:42:42] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:42:42] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Successfully pulled image "fluent/fluent-bit:1.6" in 1.189195924s
<info> [2021-12-07 10:42:42] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Created container determined-fluent-container
<info> [2021-12-07 10:42:42] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Started container determined-fluent-container
<info> [2021-12-07 10:42:42] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info> [2021-12-07 10:42:43] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 884.846614ms
<info> [2021-12-07 10:42:43] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Created container determined-container
<info> [2021-12-07 10:42:43] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Started container determined-container
<info> [2021-12-07 10:42:43] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 900.212087ms
<info> [2021-12-07 10:42:43] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Created container determined-container
<info> [2021-12-07 10:42:44] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Started container determined-container
<> [2021-12-07 10:42:44] 087c0ee7 || + STARTUP_HOOK=startup-hook.sh
<> [2021-12-07 10:42:44] 087c0ee7 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:42:44] 087c0ee7 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:42:44] 087c0ee7 || + '[' -z '' ']'
<> [2021-12-07 10:42:44] 087c0ee7 || + export DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:42:44] 087c0ee7 || + DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:42:44] 087c0ee7 || + '[' /root = / ']'
<> [2021-12-07 10:42:44] 087c0ee7 || + /bin/which python3
<> [2021-12-07 10:42:44] 087c0ee7 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<> [2021-12-07 10:42:45] ef45fdce || + STARTUP_HOOK=startup-hook.sh
<> [2021-12-07 10:42:45] ef45fdce || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:42:45] ef45fdce || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<> [2021-12-07 10:42:45] ef45fdce || + DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:42:45] ef45fdce || + '[' -z '' ']'
<> [2021-12-07 10:42:45] ef45fdce || + export DET_PYTHON_EXECUTABLE=python3
<> [2021-12-07 10:42:45] ef45fdce || + /bin/which python3
<> [2021-12-07 10:42:45] ef45fdce || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<> [2021-12-07 10:42:45] ef45fdce || + '[' /root = / ']'
<warning> [2021-12-07 10:42:45] 087c0ee7 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<> [2021-12-07 10:42:45] 087c0ee7 || + python3 -m determined.exec.prep_container --trial --resources
<warning> [2021-12-07 10:42:45] ef45fdce || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<> [2021-12-07 10:42:46] 087c0ee7 || + test -f startup-hook.sh
<> [2021-12-07 10:42:46] 087c0ee7 || + python3 -m determined.exec.prep_container --rendezvous
<> [2021-12-07 10:42:46] ef45fdce || + python3 -m determined.exec.prep_container --trial --resources
<> [2021-12-07 10:42:46] ef45fdce || + test -f startup-hook.sh
<> [2021-12-07 10:42:46] ef45fdce || + python3 -m determined.exec.prep_container --rendezvous
<> [2021-12-07 10:42:46] 087c0ee7 || + exec python3 -m determined.exec.launch_autohorovod
<> [2021-12-07 10:42:46] ef45fdce || + exec python3 -m determined.exec.launch_autohorovod
<info> [2021-12-07 10:42:46] 087c0ee7 || INFO: New trial runner in (container 087c0ee7-97fa-47da-a4d2-70db4437a561) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<info> [2021-12-07 10:42:46] ef45fdce || INFO: New trial runner in (container ef45fdce-eb10-4fc0-991b-8df1e20d41d4) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<> [2021-12-07 10:42:49] 087c0ee7 || 2021-12-07 10:42:49.029033: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:42:49] 087c0ee7 || 2021-12-07 10:42:49.029086: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:42:51] 087c0ee7 || 2021-12-07 10:42:51.287981: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:42:51] 087c0ee7 || 2021-12-07 10:42:51.288043: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:42:54] ef45fdce [rank=1] || 2021-12-07 10:42:54,280:INFO [56]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<> [2021-12-07 10:42:54] 087c0ee7 [rank=0] || 2021-12-07 10:42:54,343:INFO [207]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<> [2021-12-07 10:42:54] ef45fdce [rank=1] || 2021-12-07 10:42:54.393298: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:42:54] ef45fdce [rank=1] || 2021-12-07 10:42:54.393329: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:42:54] 087c0ee7 [rank=0] || 2021-12-07 10:42:54.453085: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<> [2021-12-07 10:42:54] 087c0ee7 [rank=0] || 2021-12-07 10:42:54.453121: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<> [2021-12-07 10:42:56] ef45fdce [rank=1] || 2021-12-07 10:42:56,847:INFO [56]: Creating TFKerasTrialController with FlowerClassificationTrial.
<> [2021-12-07 10:42:56] ef45fdce [rank=1] || 2021-12-07 10:42:56.847483: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:42:56] ef45fdce [rank=1] || 2021-12-07 10:42:56.847812: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<> [2021-12-07 10:42:56] ef45fdce [rank=1] || 2021-12-07 10:42:56.847836: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<> [2021-12-07 10:42:56] ef45fdce [rank=1] || 2021-12-07 10:42:56.847870: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ada): /proc/driver/nvidia/version does not exist
<> [2021-12-07 10:42:56] ef45fdce [rank=1] || 2021-12-07 10:42:56.848859: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
<> [2021-12-07 10:42:56] ef45fdce [rank=1] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<> [2021-12-07 10:42:56] ef45fdce [rank=1] || 2021-12-07 10:42:56.850466: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:42:56] 087c0ee7 [rank=0] || 2021-12-07 10:42:56,868:INFO [207]: Creating TFKerasTrialController with FlowerClassificationTrial.
<> [2021-12-07 10:42:56] 087c0ee7 [rank=0] || 2021-12-07 10:42:56.868944: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:42:56] 087c0ee7 [rank=0] || 2021-12-07 10:42:56.869343: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<> [2021-12-07 10:42:56] 087c0ee7 [rank=0] || 2021-12-07 10:42:56.869393: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<> [2021-12-07 10:42:56] 087c0ee7 [rank=0] || 2021-12-07 10:42:56.869467: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-rea): /proc/driver/nvidia/version does not exist
<> [2021-12-07 10:42:56] 087c0ee7 [rank=0] || 2021-12-07 10:42:56.871373: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
<> [2021-12-07 10:42:56] 087c0ee7 [rank=0] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<> [2021-12-07 10:42:56] 087c0ee7 [rank=0] || 2021-12-07 10:42:56.874729: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<> [2021-12-07 10:42:56] ef45fdce [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:42:56] ef45fdce [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:42:56] ef45fdce [rank=1] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<> [2021-12-07 10:42:56] ef45fdce [rank=1] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<> [2021-12-07 10:42:57] 087c0ee7 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:42:57] 087c0ee7 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<> [2021-12-07 10:42:57] 087c0ee7 [rank=0] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<> [2021-12-07 10:42:57] 087c0ee7 [rank=0] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<> [2021-12-07 10:42:57] ef45fdce [rank=1] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<> [2021-12-07 10:42:57] 087c0ee7 [rank=0] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<> [2021-12-07 10:42:58] ef45fdce [rank=1] || Sequential with layers obj made
<> [2021-12-07 10:42:58] ef45fdce [rank=1] || Wraped model in context
<> [2021-12-07 10:42:58] ef45fdce [rank=1] || Model compiled
<> [2021-12-07 10:42:58] ef45fdce [rank=1] || 2021-12-07 10:42:58,296:WARNING [56]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<> [2021-12-07 10:42:58] 087c0ee7 [rank=0] || Sequential with layers obj made
<> [2021-12-07 10:42:58] 087c0ee7 [rank=0] || Wraped model in context
<> [2021-12-07 10:42:58] 087c0ee7 [rank=0] || Model compiled
<> [2021-12-07 10:42:58] 087c0ee7 [rank=0] || 2021-12-07 10:42:58,413:WARNING [207]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<> [2021-12-07 10:42:58] 087c0ee7 [rank=0] || total batches trained: 0, workload 0% complete (0/100)
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || 2021-12-07 10:43:00.080249: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || 2021-12-07 10:43:00.084415: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || Traceback (most recent call last):
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || exec(code, run_globals)
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || return _run_code(code, main_globals, None,
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || sys.exit(main(args.chief_ip))
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || controller.run()
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || self._launch_fit()
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || self.model.fit(
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || tmp_logs = self.train_function(iterator)
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || result = self._call(*args, **kwds)
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || return self._stateless_fn(*args, **kwds)
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || return graph_function._call_flat(
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || return self._build_call_outputs(self._inference_function.call(
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || outputs = execute.execute(
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || train_function
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<> [2021-12-07 10:43:00] ef45fdce [rank=1] ||
<> [2021-12-07 10:43:00] ef45fdce [rank=1] || Function call stack:
<> [2021-12-07 10:43:00] ef45fdce [rank=1] ||
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || 2021-12-07 10:43:00.214831: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || 2021-12-07 10:43:00.219214: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || Traceback (most recent call last):
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || exec(code, run_globals)
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || return _run_code(code, main_globals, None,
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || sys.exit(main(args.chief_ip))
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || controller.run()
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || self._launch_fit()
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || self.model.fit(
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || tmp_logs = self.train_function(iterator)
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || result = self._call(*args, **kwds)
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || return self._stateless_fn(*args, **kwds)
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || return graph_function._call_flat(
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || return self._build_call_outputs(self._inference_function.call(
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || outputs = execute.execute(
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || Function call stack:
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||
<> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || train_function
<> [2021-12-07 10:43:01] 087c0ee7 || Process 1 exit with status code 1.
<> [2021-12-07 10:43:01] 087c0ee7 || Terminating remaining workers after failure of Process 1.
<> [2021-12-07 10:43:01] 087c0ee7 || [0]<stderr>:Terminated
<> [2021-12-07 10:43:01] 087c0ee7 || Process 0 exit with status code 143.
<> [2021-12-07 10:43:01] 087c0ee7 || Traceback (most recent call last):
<> [2021-12-07 10:43:01] 087c0ee7 || File "/opt/conda/bin/horovodrun", line 8, in <module>
<> [2021-12-07 10:43:01] 087c0ee7 || sys.exit(run_commandline())
<> [2021-12-07 10:43:01] 087c0ee7 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 770, in run_commandline
<> [2021-12-07 10:43:01] 087c0ee7 || _run(args)
<> [2021-12-07 10:43:01] 087c0ee7 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 760, in _run
<> [2021-12-07 10:43:01] 087c0ee7 || return _run_static(args)
<> [2021-12-07 10:43:01] 087c0ee7 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 617, in _run_static
<> [2021-12-07 10:43:01] 087c0ee7 || _launch_job(args, settings, nics, command)
<> [2021-12-07 10:43:01] 087c0ee7 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 730, in _launch_job
<> [2021-12-07 10:43:01] 087c0ee7 || run_controller(args.use_gloo, gloo_run_fn,
<> [2021-12-07 10:43:01] 087c0ee7 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 706, in run_controller
<> [2021-12-07 10:43:01] 087c0ee7 || gloo_run()
<> [2021-12-07 10:43:01] 087c0ee7 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 722, in gloo_run_fn
<> [2021-12-07 10:43:01] 087c0ee7 || gloo_run(settings, nics, env, driver_ip, command)
<> [2021-12-07 10:43:01] 087c0ee7 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 298, in gloo_run
<> [2021-12-07 10:43:01] 087c0ee7 || File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 282, in launch_gloo
<> [2021-12-07 10:43:01] 087c0ee7 || launch_gloo(command, exec_command, settings, nics, env, server_ip)
<> [2021-12-07 10:43:01] 087c0ee7 || raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
<> [2021-12-07 10:43:01] 087c0ee7 || RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
<> [2021-12-07 10:43:01] 087c0ee7 || Exit code: 1
<> [2021-12-07 10:43:01] 087c0ee7 || Process name: 1
<info> [2021-12-07 10:43:05] 087c0ee7 || INFO: container failed with non-zero exit code: (exit code 1)
<info> [2021-12-07 10:43:21] ef45fdce || INFO: rpc error: code = Unknown desc = Error: No such container: 4def014024e5fa3d7cf76695ce39c4d6821b2efe6daf98a2bcdd2ee1fc8d5cc0
<info> [2021-12-07 10:43:22] ef45fdce || INFO: container failed with non-zero exit code: (exit code 137)
Hi @ramakrishnamamidi, as suggested on the Determined Community Slack thread, please first verify that the code works with a single-GPU experiment first, before you try distributed training as outlined in the debugging guide: https://docs.determined.ai/latest/training-debug/index.html
Hi @vishnu2kmohan , I have deployed determined to use only cpu. My value.yaml file is as follows
# The image registry to use. Defaults to the determinedai repository in DockerHub.
imageRegistry: determinedai
# Install Determined enterprise edition.
enterpriseEdition: false
# Should be configured if using the master image in the Determined enterprise edition
# or private registry.
imagePullSecretName:
# masterPort configures the port at which the Determined master listens for connections on.
masterPort: 8080
# When useNodePortForMaster is set to false (default), a LoadBalancer service is deployed to make
# the Determined master reachable from outside the cluster. When useNodePortForMaster is set to
# true, the master will instead be exposed behind a NodePort service. When using a NodePort service
# users will typically have to configure an Ingress to make the Determined master reachable from
# outside the cluster. NodePort service is recommended when configuring TLS termination in a
# load-balancer.
useNodePortForMaster: false
# tlsSecret enables TLS encryption for all communication made to the Determined master (TLS
# termination is performed in the Determined master). This includes communication between the
# Determined master and the task containers it launches, but does not include communication between
# the task containers (distributed training). The specified Secret of type tls must already exist in
# the same namespace in which Determined is being installed.
# tlsSecret:
# db sets the configurations for the database.
db:
# To deploy your own Postgres DB, provide a hostAddress. If hostAddress is provided, Determined
# will skip deploying a Postgres DB.
# hostAddress:
# Required parameters, whether you are using your own DB or a Determined DB.
name: determined
user: postgres
password: postgres
port: 5432
# Only used for Determined DB deployment. Configures the size of the PersistentVolumeClaim for the
# Determined deployed database, as well as the CPU and memory requirements. Should be adjusted for
# scale.
storageSize: 30Gi
cpuRequest: 2
memRequest: 8Gi
# useNodePortForDB configures whether ClusterIP or NodePort service type is used for the
# Determined deployed DB. By default ClusterIP is used.
useNodePortForDB: false
# storageClassName configures the StorageClass used by the PersistentVolumeClaim for the
# Determined deployed database. This can be left blank if a default storage class is specified in
# the cluster. If dynamic provisioning of PersistentVolumes is disabled, users must manually
# create a PersistentVolume that will match the PersistentVolumeClaim.
# storageClassName:
# checkpointStorage controls where checkpoints are stored. Supported types include `shared_fs`,
# `gcs`, and `s3`.
checkpointStorage:
# Applicable to all checkpointStorage types.
saveExperimentBest: 0
saveTrialBest: 1
saveTrialLatest: 1
# Comment out if not using `shared_fs`. Users are strongly discouraged from using `shared_fs` for
# storage beyond initial testing as most Kubernetes cluster nodes do not have a shared file
# system.
type: shared_fs
hostPath: /checkpoints
# For storing in GCS.
# type: gcs
# bucket: <bucket_name>
# For storing in S3.
# type: s3
# bucket: <bucket_name>
# accessKey: <access_key>
# secretKey: <secret_key>
# endpointUrl: <endpoint_url>
# For storing in Azure Blob Storage with a connection string.
# Do NOT use if already using Azure Blob Storage with account URL
# type: azure
# container: <container_name>
# connection_string: <connection_string>
# For storing in Azure Blob Storage with an account URL.
# Do NOT use if already using Azure Blob Storage with connection string.
# The `credential` field is optional.
# type: azure
# container: <container_name>
# account_url: <account_url>
# credential: <credential>
# This is the number of GPUs there are per machine. Determined uses this information when scheduling
# multi-GPU tasks. Each multi-GPU (distributed training) task will be scheduled as a set of
# `slotsPerTask / maxSlotsPerPod` separate pods, with each pod assigned up to `maxSlotsPerPod` GPUs.
# Distributed tasks with sizes that are not divisible by `maxSlotsPerPod` are never scheduled. If
# you have a cluster of different size nodes (e.g., 4 and 8 GPUs per node), set `maxSlotsPerPod` to
# the greatest common divisor of all the sizes (4, in that case).
maxSlotsPerPod: 1
## For CPU-only clusters, use `slotType: cpu`, and make sure to set `slotResourceRequest` below.
# slotType: cpu
# slotResourceRequests:
## Number of cpu units requested for compute slots. Note: since kubernetes may schedule some
## system tasks on the nodes which take up some resources, 8-core node may not always fit
## a `cpu: 8` task container.
# cpu: 7
slotType: cpu
slotResourceRequests:
cpu: 4
# Memory and CPU requirements for the master instance. Should be adjusted for scale.
masterCpuRequest: 2
masterMemRequest: 8Gi
## Configure the task container defaults. Tasks include trials, commands, TensorBoards, notebooks,
## and shells. For all task containers, shm_size_bytes and network_mode are configurable. For
## trials, the network interface used by distributed (multi-machine) training and ports used by the
## NCCL and GLOO libraries during distributed training are configurable. These default to
## auto-discovery and random non-privileged ports, respectively.
taskContainerDefaults:
# networkMode: bridge
# dtrainNetworkInterface: <network interface name>
# ncclPortRange: <MIN:MAX>
# glooPortRange: <MIN:MAX>
# forcePullImage: <true or false>
# Configure a default pod spec for all GPU tasks (experiments, notebooks, commands) and CPU tasks
# (CPU notebooks, TensorBoards, zero-slot commands). If a pod spec is defined for an individual
# task, that pod spec will replace the default one that is defined here. See
# https://docs:determined.ai/latest/topic-guides/custom-pod-specs.html for more details.
# cpuPodSpec:
# gpuPodSpec:
# Configure default Docker images for all GPU tasks (experiments, notebooks, commands) and
# CPU tasks (CPU notebooks, TensorBoards, zero-slot commands). If a Docker image is defined
# for an individual task, that image will replace the default one that is defined here.
# If specifying a default image, both GPU and CPU default images must be defined.
# cpuImage:
# gpuImage:
## Configure whether we collect anonymous information about the usage of Determined.
telemetry:
enabled: true
## A user-friendly name to identify this cluster by.
# clusterName: Dev
# defaultPassword sets the password for the admin and determined user accounts.
# defaultPassword:
## Configure how trial logs are stored.
# logging:
## The backend to use. Can be `default` to send logs to the master to store in the PostgreSQL
## database or `elastic` to store logs in an Elasticsearch cluster (without going through the
## master).
# type: default
## The remaining options should be provided only for the `elastic` backend.
## The host and port to use to connect to the Elasticsearch cluster.
# host: <host>
# port: <port>
## Authentication and TLS options for making the connection to Elasticsearch.
# security:
# username: <username>
# password: <password>
# tls:
# enabled: true
# skipVerify: false
## The name to use when verifying the certificate, if different from the name used to connect.
# certificateName: <name>
## This value must contain the contents of the certificate file, not a path. It may be set
## directly or using `helm install --set-file logging.security.tls.certificate=<path>`.
# certificate: <certificate contents>
## Configure the default Determined scheduler
## Currently supports "coscheduler" for gang scheduling and "preemption" for priority based
## scheduling with preemption
# defaultScheduler: preemption
I want to use slots_per_trial to throttle number of pods and check the timing of training a DL model.
Hi @ramakrishnamamidi is it safe to close this issue now that it has been resolved by re-uploading your dataset?
Please reopen if necessary.
Hi I am trying to create a distributed training experiment using flower-classification dataset.
Below is my model_def.py code
Following is my distribute.yaml file
Dockerfile for image used above
When i create an experiment I am getting the following logs
Can someone help me with what is going wrong does this error mean data is being read incorrectly or its not being read or the tfrec is corrupted.