determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
https://determined.ai
Apache License 2.0
3.02k stars 352 forks source link

Issue in distributed training #3275

Closed ramakrishnamamidi closed 2 years ago

ramakrishnamamidi commented 2 years ago

Hi I am trying to create a distributed training experiment using flower-classification dataset.

Below is my model_def.py code

import tensorflow as tf
import numpy as np

import urllib.request
import os

from determined.keras import TFKerasTrial
from tensorflow import keras

os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxxxxx"
os.environ["AWS_REGION"] = "us-east-1"
os.environ["S3_ENDPOINT"] = "xx.x.xxx.xx:xxxxx"

DS_PATH = "s3://dataset/flower-classification"

IMAGE_SIZE = [512, 512] # At this size, a GPU will run out of memory. Use the TPU.
                        # For GPU training, please select 224 x 224 px image size.

EPOCHS = 12 #int(os.environ.get("epochs")) or 12
BATCH_SIZE = 16 #int(os.environ.get("batch_size")) or 16

STEPS_PER_EPOCH = 1
VALIDATION_STEPS = 1
TEST_STEPS = 1

PATH_SELECT = { # available image sizes
    192: DS_PATH + '/tfrecords-jpeg-192x192',
    224: DS_PATH + '/tfrecords-jpeg-224x224',
    331: DS_PATH + '/tfrecords-jpeg-331x331',
    512: DS_PATH + '/tfrecords-jpeg-512x512'
}

MINIO_PATH = PATH_SELECT[IMAGE_SIZE[0]]

TRAINING_FILENAMES= """00-512x512-798.tfrec  03-512x512-798.tfrec  06-512x512-798.tfrec  09-512x512-798.tfrec  12-512x512-798.tfrec  15-512x512-783.tfrec
01-512x512-798.tfrec  04-512x512-798.tfrec  07-512x512-798.tfrec  10-512x512-798.tfrec  13-512x512-798.tfrec
02-512x512-798.tfrec  05-512x512-798.tfrec  08-512x512-798.tfrec  11-512x512-798.tfrec  14-512x512-798.tfrec""".split()

TRAINING_FILENAMES = ["s3://dataset/flower-classification/tfrecords-jpeg-512x512/train/"+i for i in TRAINING_FILENAMES]

VALIDATION_FILENAMES="""00-512x512-232.tfrec  03-512x512-232.tfrec  06-512x512-232.tfrec  09-512x512-232.tfrec  12-512x512-232.tfrec  15-512x512-232.tfrec
01-512x512-232.tfrec  04-512x512-232.tfrec  07-512x512-232.tfrec  10-512x512-232.tfrec  13-512x512-232.tfrec
02-512x512-232.tfrec  05-512x512-232.tfrec  08-512x512-232.tfrec  11-512x512-232.tfrec  14-512x512-232.tfrec""".split()

VALIDATION_FILENAMES=[ "s3://dataset/flower-classification/tfrecords-jpeg-512x512/val/"+i for i in VALIDATION_FILENAMES]

TEST_FILENAMES = """00-512x512-462.tfrec  03-512x512-462.tfrec  06-512x512-462.tfrec  09-512x512-462.tfrec  12-512x512-462.tfrec  15-512x512-452.tfrec
01-512x512-462.tfrec  04-512x512-462.tfrec  07-512x512-462.tfrec  10-512x512-462.tfrec  13-512x512-462.tfrec
02-512x512-462.tfrec  05-512x512-462.tfrec  08-512x512-462.tfrec  11-512x512-462.tfrec  14-512x512-462.tfrec""".split()

TEST_FILENAMES = [  "s3://dataset/flower-classification/tfrecords-jpeg-512x512/test/"+i for i in TEST_FILENAMES]

CLASSES = ['pink primrose', 'hard-leaved pocket orchid', 'canterbury bells', 'sweet pea', 'wild geranium', 'tiger lily', 'moon orchid', 'bird of paradise', 'monkshood', 'globe thistle', 'snapdragon', "colt's foot", 'king protea', 'spear thistle', 'yellow iris', 'globe-flower', 'purple coneflower', 'peruvian lily', 'balloon flower', 'giant white arum lily', 'fire lily', 'pincushion flower', 'fritillary', 'red ginger', 'grape hyacinth', 'corn poppy', 'prince of wales feathers', 'stemless gentian', 'artichoke', 'sweet william', 'carnation', 'garden phlox', 'love in the mist', 'cosmos', 'alpine sea holly', 'ruby-lipped cattleya', 'cape flower', 'great masterwort', 'siam tulip', 'lenten rose', 'barberton daisy', 'daffodil', 'sword lily', 'poinsettia', 'bolero deep blue', 'wallflower', 'marigold', 'buttercup', 'daisy', 'common dandelion', 'petunia', 'wild pansy', 'primula', 'sunflower', 'lilac hibiscus', 'bishop of llandaff', 'gaura', 'geranium', 'orange dahlia', 'pink-yellow dahlia', 'cautleya spicata', 'japanese anemone', 'black-eyed susan', 'silverbush', 'californian poppy', 'osteospermum', 'spring crocus', 'iris', 'windflower', 'tree poppy', 'gazania', 'azalea', 'water lily', 'rose', 'thorn apple', 'morning glory', 'passion flower', 'lotus', 'toad lily', 'anthurium', 'frangipani', 'clematis', 'hibiscus', 'columbine', 'desert-rose', 'tree mallow', 'magnolia', 'cyclamen ', 'watercress', 'canna lily', 'hippeastrum ', 'bee balm', 'pink quill', 'foxglove', 'bougainvillea', 'camellia', 'mallow', 'mexican petunia', 'bromelia', 'blanket flower', 'trumpet creeper', 'blackberry lily', 'common tulip', 'wild rose']

AUTO = tf.data.experimental.AUTOTUNE
np.set_printoptions(threshold=15, linewidth=80)

class FlowerClassificationTrial(TFKerasTrial):
    def __init__(self, context):
        self.context = context

        # # Create a unique download directory for each rank so they don't overwrite each
        # # other when doing distributed training.
        # self.download_directory = f"/tmp/data-rank{self.context.distributed.get_rank()}"
        # self.data_downloaded = False

    def build_model(self):
        img_adjust_layer = tf.keras.layers.Lambda(
            lambda data: tf.keras.applications.vgg16.preprocess_input(tf.cast(data, tf.float32)),
            input_shape=[*IMAGE_SIZE, 3])
        pretrained_model = tf.keras.applications.VGG16(weights='imagenet', include_top=False)

        pretrained_model.trainable = False  # False = transfer learning, True = fine-tuning
        model = tf.keras.Sequential([
            img_adjust_layer,
            pretrained_model,
            tf.keras.layers.GlobalAveragePooling2D(),
            tf.keras.layers.Dense(len(CLASSES), activation='softmax')
        ])

        print("Sequential with layers obj made")
        # Wrap the model.
        model = self.context.wrap_model(model)

        print("Wraped model in context")

        # Create and wrap optimizer.
        optimizer = tf.keras.optimizers.Adam()
        optimizer = self.context.wrap_optimizer(optimizer)

        model.compile(
            optimizer=optimizer,
            loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
            metrics=[tf.keras.metrics.SparseCategoricalAccuracy(name="accuracy")]
        )

        print("Model compiled")
        return model

    def batch_to_numpy_images_and_labels(self, data):
        images, labels = data
        numpy_images = images.numpy()
        numpy_labels = labels.numpy()
        if numpy_labels.dtype == object: # binary string in this case, these are image ID strings
            numpy_labels = [None for _ in enumerate(numpy_images)]
        # If no labels, only image IDs, return None for labels (this is the case for test data)
        return numpy_images, numpy_labels

    def title_from_label_and_target(self, label, correct_label):
        if correct_label is None:
            return CLASSES[label], True
        correct = (label == correct_label)
        return "{} [{}{}{}]".format(CLASSES[label], 'OK' if correct else 'NO', u"\u2192" if not correct else '',
                                        CLASSES[correct_label] if not correct else ''), correct

    def decode_image(self, image_data):
        image = tf.image.decode_jpeg(image_data, channels=3)  # image format uint8 [0,255]
        image = tf.reshape(image, [*IMAGE_SIZE, 3]) # explicit size needed for TPU
        return image

    def read_labeled_tfrecord(self, example):
        LABELED_TFREC_FORMAT = {
            "image": tf.io.FixedLenFeature([], tf.string), # tf.string means bytestring
            "class": tf.io.FixedLenFeature([], tf.int64),  # shape [] means single element
        }
        example = tf.io.parse_single_example(example, LABELED_TFREC_FORMAT)
        image = self.decode_image(example['image'])
        label = tf.cast(example['class'], tf.int32)
        return image, label # returns a dataset of (image, label) pairs

    def read_unlabeled_tfrecord(self, example):
        UNLABELED_TFREC_FORMAT = {
            "image": tf.io.FixedLenFeature([], tf.string), # tf.string means bytestring
            "id": tf.io.FixedLenFeature([], tf.string),  # shape [] means single element
            # class is missing, this competitions's challenge is to predict flower classes for the test dataset
        }
        example = tf.io.parse_single_example(example, UNLABELED_TFREC_FORMAT)
        image = self.decode_image(example['image'])
        idnum = example['id']
        return image, idnum # returns a dataset of image(s)

    def load_dataset(self, filenames, labeled=True, ordered=False):
        # Read from TFRecords. For optimal performance, reading from multiple files at once and
        # disregarding data order. Order does not matter since we will be shuffling the data anyway.

        ignore_order = tf.data.Options()
        if not ordered:
            ignore_order.experimental_deterministic = False # disable order, increase speed

        dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTO) # automatically interleaves reads from multiple files
        dataset = dataset.with_options(ignore_order) # uses data as soon as it streams in, rather than in its original order
        dataset = dataset.map(self.read_labeled_tfrecord if labeled else self.read_unlabeled_tfrecord, num_parallel_calls=AUTO)
        # returns a dataset of (image, label) pairs if labeled=True or (image, id) pairs if labeled=False
        return dataset

    def data_augment(self, image, label):
        # data augmentation. Thanks to the dataset.prefetch(AUTO) statement in the next function (below),
        # this happens essentially for free on TPU. Data pipeline code is executed on the "CPU" part
        # of the TPU while the TPU itself is computing gradients.
        image = tf.image.random_flip_left_right(image)
        #image = tf.image.random_saturation(image, 0, 2)
        return image, label

    def get_training_dataset_new(self):
        dataset = self.load_dataset(TRAINING_FILENAMES, labeled=True)
        print(dataset, type(dataset))
        dataset = dataset.map(self.data_augment, num_parallel_calls=AUTO)
        print(dataset, type(dataset))
        dataset = self.context.wrap_dataset(dataset)
        print(dataset, type(dataset))
        dataset = dataset.cache().shuffle(2048).batch(self.context.get_per_slot_batch_size()).repeat()
        dataset = dataset.prefetch(AUTO)  # prefetch next batch while training (autotune prefetch buffer size)
        print(dataset, type(dataset))
        return dataset

    def get_validation_dataset_new(self):
        dataset = self.load_dataset(VALIDATION_FILENAMES, labeled=True, ordered=False)
        dataset = self.context.wrap_dataset(dataset)
        dataset = dataset.batch(self.context.get_per_slot_batch_size())
        # dataset = dataset.cache()
        # dataset = dataset.prefetch(AUTO)  # prefetch next batch while training (autotune prefetch buffer size)
        return dataset

    def build_training_data_loader(self):
        train_dataset = self.get_training_dataset_new()
        return train_dataset

    def build_validation_data_loader(self):
        validation_dataset = self.get_validation_dataset_new()
        return validation_dataset

Following is my distribute.yaml file

name: flower-classification
hyperparameters:
  global_batch_size: 256
  dense1: 128
environment:
  image: ramakrishna1592/flower-classification-determinedai:v1
  pod_spec:
    resources:
      requests:
        memory: "1Gi"
        cpu: "1"
      limits:
        memory: "2Gi"
        cpu: "2"
resources:
  slots_per_trial: 2
records_per_epoch: 60000
searcher:
  name: single
  metric: val_accuracy
  smaller_is_better: false
  max_length:
    epochs: 5
entrypoint: model_def2:FlowerClassificationTrial

Dockerfile for image used above

FROM determinedai/environments:py-3.8-pytorch-1.9-lightning-1.3-tf-2.4-cpu-0.17.4

RUN pip install boto3 tensorflow_io==0.17.1 

When i create an experiment I am getting the following logs

<info> [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     return self._stateless_fn(*args, **kwds)
<info> [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<info> [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     return graph_function._call_flat(
<info> [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<info> [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     return self._build_call_outputs(self._inference_function.call(
<info> [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<info> [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     outputs = execute.execute(
<info> [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<info> [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<info> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || tensorflow.python.framework.errors_impl.DataLossError:  corrupted record at 0
<info> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || 
<info> [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||    [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<info> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || Function call stack:
<info> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || 
<info> [2021-12-07 10:43:00] 087c0ee7 [rank=0] || train_function
<info> [2021-12-07 10:43:01] 087c0ee7 || Process 1 exit with status code 1.
<info> [2021-12-07 10:43:01] 087c0ee7 || Terminating remaining workers after failure of Process 1.
<info> [2021-12-07 10:43:01] 087c0ee7 || [0]<stderr>:Terminated
<info> [2021-12-07 10:43:01] 087c0ee7 || Process 0 exit with status code 143.
<info> [2021-12-07 10:43:01] 087c0ee7 || Traceback (most recent call last):
<info> [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/bin/horovodrun", line 8, in <module>
<info> [2021-12-07 10:43:01] 087c0ee7 ||     sys.exit(run_commandline())
<info> [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 770, in run_commandline
<info> [2021-12-07 10:43:01] 087c0ee7 ||     _run(args)
<info> [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 760, in _run
<info> [2021-12-07 10:43:01] 087c0ee7 ||     return _run_static(args)
<info> [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 617, in _run_static
<info> [2021-12-07 10:43:01] 087c0ee7 ||     _launch_job(args, settings, nics, command)
<info> [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 730, in _launch_job
<info> [2021-12-07 10:43:01] 087c0ee7 ||     run_controller(args.use_gloo, gloo_run_fn,
<info> [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 706, in run_controller
<info> [2021-12-07 10:43:01] 087c0ee7 ||     gloo_run()
<info> [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 722, in gloo_run_fn
<info> [2021-12-07 10:43:01] 087c0ee7 ||     gloo_run(settings, nics, env, driver_ip, command)
<info> [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 298, in gloo_run
<info> [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 282, in launch_gloo
<info> [2021-12-07 10:43:01] 087c0ee7 ||     launch_gloo(command, exec_command, settings, nics, env, server_ip)
<info> [2021-12-07 10:43:01] 087c0ee7 ||     raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
<info> [2021-12-07 10:43:01] 087c0ee7 || RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
<info> [2021-12-07 10:43:01] 087c0ee7 || Exit code: 1
<info> [2021-12-07 10:43:01] 087c0ee7 || Process name: 1
<info> [2021-12-07 10:43:05] 087c0ee7 || INFO: container failed with non-zero exit code:  (exit code 1)
<info> [2021-12-07 10:43:21] ef45fdce || INFO: rpc error: code = Unknown desc = Error: No such container: 4def014024e5fa3d7cf76695ce39c4d6821b2efe6daf98a2bcdd2ee1fc8d5cc0
<info> [2021-12-07 10:43:22] ef45fdce || INFO: container failed with non-zero exit code:  (exit code 137)

Can someone help me with what is going wrong does this error mean data is being read incorrectly or its not being read or the tfrec is corrupted.

ramakrishnamamidi commented 2 years ago

Detailed Log of the experiment

<info>    [2021-12-07 10:39:05] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Pod resources allocated.
<info>    [2021-12-07 10:39:05] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Pod resources allocated.
<info>    [2021-12-07 10:39:05] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:39:05] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:39:06] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 970.970925ms
<info>    [2021-12-07 10:39:06] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 903.701493ms
<info>    [2021-12-07 10:39:06] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Created container determined-init-container
<info>    [2021-12-07 10:39:06] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Created container determined-init-container
<info>    [2021-12-07 10:39:06] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Started container determined-init-container
<info>    [2021-12-07 10:39:06] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Started container determined-init-container
<info>    [2021-12-07 10:39:07] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Pulling image "fluent/fluent-bit:1.6"
<info>    [2021-12-07 10:39:07] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Pulling image "fluent/fluent-bit:1.6"
<info>    [2021-12-07 10:39:08] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Successfully pulled image "fluent/fluent-bit:1.6" in 1.197434698s
<info>    [2021-12-07 10:39:08] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Created container determined-fluent-container
<info>    [2021-12-07 10:39:08] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Started container determined-fluent-container
<info>    [2021-12-07 10:39:08] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:39:08] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Successfully pulled image "fluent/fluent-bit:1.6" in 1.174634066s
<info>    [2021-12-07 10:39:08] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Created container determined-fluent-container
<info>    [2021-12-07 10:39:09] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Started container determined-fluent-container
<info>    [2021-12-07 10:39:09] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:39:09] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 891.734985ms
<info>    [2021-12-07 10:39:09] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Created container determined-container
<info>    [2021-12-07 10:39:09] 26606941 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-merry-marmoset: Started container determined-container
<info>    [2021-12-07 10:39:09] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 887.005158ms
<info>    [2021-12-07 10:39:10] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Created container determined-container
<info>    [2021-12-07 10:39:10] 20c5c597 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-solid-chicken: Started container determined-container
<>        [2021-12-07 10:39:11] 26606941 || + STARTUP_HOOK=startup-hook.sh
<>        [2021-12-07 10:39:11] 26606941 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:39:11] 26606941 || + '[' -z '' ']'
<>        [2021-12-07 10:39:11] 26606941 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:39:11] 26606941 || + /bin/which python3
<>        [2021-12-07 10:39:11] 26606941 || + DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:39:11] 26606941 || + export DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:39:11] 26606941 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<>        [2021-12-07 10:39:11] 26606941 || + '[' /root = / ']'
<>        [2021-12-07 10:39:11] 20c5c597 || + STARTUP_HOOK=startup-hook.sh
<>        [2021-12-07 10:39:11] 20c5c597 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:39:11] 20c5c597 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:39:11] 20c5c597 || + '[' -z '' ']'
<>        [2021-12-07 10:39:11] 20c5c597 || + DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:39:11] 20c5c597 || + export DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:39:11] 20c5c597 || + /bin/which python3
<>        [2021-12-07 10:39:11] 20c5c597 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<>        [2021-12-07 10:39:11] 20c5c597 || + '[' /root = / ']'
<warning> [2021-12-07 10:39:11] 26606941 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<>        [2021-12-07 10:39:12] 26606941 || + python3 -m determined.exec.prep_container --trial --resources
<warning> [2021-12-07 10:39:12] 20c5c597 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<>        [2021-12-07 10:39:12] 20c5c597 || + python3 -m determined.exec.prep_container --trial --resources
<>        [2021-12-07 10:39:12] 26606941 || + test -f startup-hook.sh
<>        [2021-12-07 10:39:12] 26606941 || + python3 -m determined.exec.prep_container --rendezvous
<>        [2021-12-07 10:39:12] 20c5c597 || + test -f startup-hook.sh
<>        [2021-12-07 10:39:12] 20c5c597 || + python3 -m determined.exec.prep_container --rendezvous
<>        [2021-12-07 10:39:13] 26606941 || + exec python3 -m determined.exec.launch_autohorovod
<>        [2021-12-07 10:39:13] 20c5c597 || + exec python3 -m determined.exec.launch_autohorovod
<info>    [2021-12-07 10:39:13] 26606941 || INFO: New trial runner in (container 26606941-c636-4569-809f-3dcb7cbd64c0) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<info>    [2021-12-07 10:39:13] 20c5c597 || INFO: New trial runner in (container 20c5c597-0b2c-4dec-a6bb-a9ba5acb49f5) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<>        [2021-12-07 10:39:15] 26606941 || 2021-12-07 10:39:15.271627: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:39:15] 26606941 || 2021-12-07 10:39:15.271676: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:39:17] 26606941 || 2021-12-07 10:39:17.424460: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:39:17] 26606941 || 2021-12-07 10:39:17.424512: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:39:20] 26606941 [rank=0] || 2021-12-07 10:39:20,264:INFO [175]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<>        [2021-12-07 10:39:20] 20c5c597 [rank=1] || 2021-12-07 10:39:20,310:INFO [56]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<>        [2021-12-07 10:39:20] 26606941 [rank=0] || 2021-12-07 10:39:20.377771: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:39:20] 26606941 [rank=0] || 2021-12-07 10:39:20.377805: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:39:20] 20c5c597 [rank=1] || 2021-12-07 10:39:20.450750: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:39:20] 20c5c597 [rank=1] || 2021-12-07 10:39:20.450790: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:39:23] 20c5c597 [rank=1] || 2021-12-07 10:39:23,017:INFO [56]: Creating TFKerasTrialController with FlowerClassificationTrial.
<>        [2021-12-07 10:39:23] 20c5c597 [rank=1] || 2021-12-07 10:39:23.017584: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:39:23] 20c5c597 [rank=1] || 2021-12-07 10:39:23.017814: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:39:23] 20c5c597 [rank=1] || 2021-12-07 10:39:23.017837: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<>        [2021-12-07 10:39:23] 20c5c597 [rank=1] || 2021-12-07 10:39:23.017875: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-sol): /proc/driver/nvidia/version does not exist
<>        [2021-12-07 10:39:23] 20c5c597 [rank=1] || 2021-12-07 10:39:23.018851: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
<>        [2021-12-07 10:39:23] 20c5c597 [rank=1] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<>        [2021-12-07 10:39:23] 26606941 [rank=0] || 2021-12-07 10:39:23,020:INFO [175]: Creating TFKerasTrialController with FlowerClassificationTrial.
<>        [2021-12-07 10:39:23] 26606941 [rank=0] || 2021-12-07 10:39:23.020706: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:39:23] 26606941 [rank=0] || 2021-12-07 10:39:23.021048: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<>        [2021-12-07 10:39:23] 26606941 [rank=0] || 2021-12-07 10:39:23.021026: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:39:23] 26606941 [rank=0] || 2021-12-07 10:39:23.021083: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.0-mer): /proc/driver/nvidia/version does not exist
<>        [2021-12-07 10:39:23] 20c5c597 [rank=1] || 2021-12-07 10:39:23.021637: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:39:23] 26606941 [rank=0] || 2021-12-07 10:39:23.022068: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
<>        [2021-12-07 10:39:23] 26606941 [rank=0] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<>        [2021-12-07 10:39:23] 26606941 [rank=0] || 2021-12-07 10:39:23.023287: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:39:23] 20c5c597 [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:39:23] 26606941 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:39:23] 20c5c597 [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:39:23] 20c5c597 [rank=1] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<>        [2021-12-07 10:39:23] 20c5c597 [rank=1] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<>        [2021-12-07 10:39:23] 26606941 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:39:23] 26606941 [rank=0] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<>        [2021-12-07 10:39:23] 26606941 [rank=0] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<>        [2021-12-07 10:39:23] 20c5c597 [rank=1] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<>        [2021-12-07 10:39:23] 26606941 [rank=0] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<>        [2021-12-07 10:39:24] 20c5c597 [rank=1] || Sequential with layers obj made
<>        [2021-12-07 10:39:24] 20c5c597 [rank=1] || Wraped model in context
<>        [2021-12-07 10:39:24] 20c5c597 [rank=1] || Model compiled
<>        [2021-12-07 10:39:24] 20c5c597 [rank=1] || 2021-12-07 10:39:24,347:WARNING [56]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<>        [2021-12-07 10:39:24] 26606941 [rank=0] || Sequential with layers obj made
<>        [2021-12-07 10:39:24] 26606941 [rank=0] || Wraped model in context
<>        [2021-12-07 10:39:24] 26606941 [rank=0] || Model compiled
<>        [2021-12-07 10:39:24] 26606941 [rank=0] || 2021-12-07 10:39:24,746:WARNING [175]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<>        [2021-12-07 10:39:24] 26606941 [rank=0] || total batches trained: 0, workload 0% complete (0/100)
<>        [2021-12-07 10:39:26] 26606941 [rank=0] || 2021-12-07 10:39:26.430134: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<>        [2021-12-07 10:39:26] 26606941 [rank=0] || 2021-12-07 10:39:26.434228: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<>        [2021-12-07 10:39:26] 26606941 [rank=0] || Traceback (most recent call last):
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||     return _run_code(code, main_globals, None,
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||     exec(code, run_globals)
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||     sys.exit(main(args.chief_ip))
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||     controller.run()
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||     self._launch_fit()
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||     self.model.fit(
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||     tmp_logs = self.train_function(iterator)
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||     result = self._call(*args, **kwds)
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||     return self._stateless_fn(*args, **kwds)
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||     return graph_function._call_flat(
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||     return self._build_call_outputs(self._inference_function.call(
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||     outputs = execute.execute(
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<>        [2021-12-07 10:39:26] 26606941 [rank=0] || tensorflow.python.framework.errors_impl.DataLossError:  corrupted record at 0
<>        [2021-12-07 10:39:26] 26606941 [rank=0] || Function call stack:
<>        [2021-12-07 10:39:26] 26606941 [rank=0] || 
<>        [2021-12-07 10:39:26] 26606941 [rank=0] ||     [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<>        [2021-12-07 10:39:26] 26606941 [rank=0] || train_function
<>        [2021-12-07 10:39:26] 26606941 [rank=0] || 
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] || 2021-12-07 10:39:26.543108: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] || 2021-12-07 10:39:26.547507: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] || Traceback (most recent call last):
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||     return _run_code(code, main_globals, None,
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||     exec(code, run_globals)
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||     sys.exit(main(args.chief_ip))
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||     controller.run()
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||     self._launch_fit()
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||     self.model.fit(
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||     tmp_logs = self.train_function(iterator)
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||     result = self._call(*args, **kwds)
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||     return self._stateless_fn(*args, **kwds)
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||     return graph_function._call_flat(
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||     return self._build_call_outputs(self._inference_function.call(
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||     outputs = execute.execute(
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] || tensorflow.python.framework.errors_impl.DataLossError:  corrupted record at 0
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] || train_function
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] ||     [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] || 
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] || Function call stack:
<>        [2021-12-07 10:39:26] 20c5c597 [rank=1] || 
<>        [2021-12-07 10:39:27] 26606941 || Process 0 exit with status code 1.
<>        [2021-12-07 10:39:27] 26606941 || Terminating remaining workers after failure of Process 0.
<>        [2021-12-07 10:39:27] 26606941 || Traceback (most recent call last):
<>        [2021-12-07 10:39:27] 26606941 ||   File "/opt/conda/bin/horovodrun", line 8, in <module>
<>        [2021-12-07 10:39:27] 26606941 ||     sys.exit(run_commandline())
<>        [2021-12-07 10:39:27] 26606941 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 770, in run_commandline
<>        [2021-12-07 10:39:27] 26606941 ||     _run(args)
<>        [2021-12-07 10:39:27] 26606941 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 760, in _run
<>        [2021-12-07 10:39:27] 26606941 ||     return _run_static(args)
<>        [2021-12-07 10:39:27] 26606941 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 617, in _run_static
<>        [2021-12-07 10:39:27] 26606941 ||     _launch_job(args, settings, nics, command)
<>        [2021-12-07 10:39:27] 26606941 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 730, in _launch_job
<>        [2021-12-07 10:39:27] 26606941 ||     run_controller(args.use_gloo, gloo_run_fn,
<>        [2021-12-07 10:39:27] 26606941 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 706, in run_controller
<>        [2021-12-07 10:39:27] 26606941 ||     gloo_run()
<>        [2021-12-07 10:39:27] 26606941 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 722, in gloo_run_fn
<>        [2021-12-07 10:39:27] 26606941 ||     gloo_run(settings, nics, env, driver_ip, command)
<>        [2021-12-07 10:39:27] 26606941 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 298, in gloo_run
<>        [2021-12-07 10:39:27] 26606941 ||     launch_gloo(command, exec_command, settings, nics, env, server_ip)
<>        [2021-12-07 10:39:27] 26606941 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 282, in launch_gloo
<>        [2021-12-07 10:39:27] 26606941 ||     raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
<>        [2021-12-07 10:39:27] 26606941 || RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
<>        [2021-12-07 10:39:27] 26606941 || Exit code: 1
<>        [2021-12-07 10:39:27] 26606941 || Process name: 0
<info>    [2021-12-07 10:39:27] 26606941 || INFO: container failed with non-zero exit code:  (exit code 1)
<info>    [2021-12-07 10:39:44] 20c5c597 || INFO: container failed with non-zero exit code:  (exit code 137)
<info>    [2021-12-07 10:39:45] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Pod resources allocated.
<info>    [2021-12-07 10:39:45] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Pod resources allocated.
<info>    [2021-12-07 10:39:46] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:39:46] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:39:47] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 872.456065ms
<info>    [2021-12-07 10:39:47] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Created container determined-init-container
<info>    [2021-12-07 10:39:47] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 888.796743ms
<info>    [2021-12-07 10:39:47] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Created container determined-init-container
<info>    [2021-12-07 10:39:47] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Started container determined-init-container
<info>    [2021-12-07 10:39:47] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Started container determined-init-container
<info>    [2021-12-07 10:39:48] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Pulling image "fluent/fluent-bit:1.6"
<info>    [2021-12-07 10:39:48] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Pulling image "fluent/fluent-bit:1.6"
<info>    [2021-12-07 10:39:49] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Successfully pulled image "fluent/fluent-bit:1.6" in 1.165672526s
<info>    [2021-12-07 10:39:49] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Created container determined-fluent-container
<info>    [2021-12-07 10:39:49] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Started container determined-fluent-container
<info>    [2021-12-07 10:39:49] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:39:49] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Successfully pulled image "fluent/fluent-bit:1.6" in 1.163258809s
<info>    [2021-12-07 10:39:49] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Created container determined-fluent-container
<info>    [2021-12-07 10:39:49] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Started container determined-fluent-container
<info>    [2021-12-07 10:39:49] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:39:50] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 885.112166ms
<info>    [2021-12-07 10:39:50] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Created container determined-container
<info>    [2021-12-07 10:39:50] 30a87832 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-model-snapper: Started container determined-container
<info>    [2021-12-07 10:39:50] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 1.081632232s
<info>    [2021-12-07 10:39:51] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Created container determined-container
<info>    [2021-12-07 10:39:51] 152337f9 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-simple-flounder: Started container determined-container
<>        [2021-12-07 10:39:52] 30a87832 || + STARTUP_HOOK=startup-hook.sh
<>        [2021-12-07 10:39:52] 30a87832 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:39:52] 30a87832 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:39:52] 30a87832 || + '[' -z '' ']'
<>        [2021-12-07 10:39:52] 30a87832 || + export DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:39:52] 30a87832 || + DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:39:52] 30a87832 || + /bin/which python3
<>        [2021-12-07 10:39:52] 30a87832 || + '[' /root = / ']'
<>        [2021-12-07 10:39:52] 30a87832 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<>        [2021-12-07 10:39:52] 152337f9 || + STARTUP_HOOK=startup-hook.sh
<>        [2021-12-07 10:39:52] 152337f9 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:39:52] 152337f9 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:39:52] 152337f9 || + '[' -z '' ']'
<>        [2021-12-07 10:39:52] 152337f9 || + /bin/which python3
<>        [2021-12-07 10:39:52] 152337f9 || + export DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:39:52] 152337f9 || + DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:39:52] 152337f9 || + '[' /root = / ']'
<>        [2021-12-07 10:39:52] 152337f9 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<warning> [2021-12-07 10:39:52] 30a87832 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<>        [2021-12-07 10:39:52] 30a87832 || + python3 -m determined.exec.prep_container --trial --resources
<>        [2021-12-07 10:39:53] 30a87832 || + test -f startup-hook.sh
<>        [2021-12-07 10:39:53] 30a87832 || + python3 -m determined.exec.prep_container --rendezvous
<warning> [2021-12-07 10:39:53] 152337f9 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<>        [2021-12-07 10:39:53] 152337f9 || + python3 -m determined.exec.prep_container --trial --resources
<>        [2021-12-07 10:39:53] 152337f9 || + test -f startup-hook.sh
<>        [2021-12-07 10:39:53] 152337f9 || + python3 -m determined.exec.prep_container --rendezvous
<>        [2021-12-07 10:39:54] 152337f9 || + exec python3 -m determined.exec.launch_autohorovod
<>        [2021-12-07 10:39:54] 30a87832 || + exec python3 -m determined.exec.launch_autohorovod
<info>    [2021-12-07 10:39:54] 152337f9 || INFO: New trial runner in (container 152337f9-75e9-4d7b-84f1-f44b12f7309d) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<info>    [2021-12-07 10:39:54] 30a87832 || INFO: New trial runner in (container 30a87832-9e8d-4020-8583-040e84a229a1) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<>        [2021-12-07 10:39:56] 30a87832 || 2021-12-07 10:39:56.201596: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:39:56] 30a87832 || 2021-12-07 10:39:56.201642: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:39:58] 30a87832 || 2021-12-07 10:39:58.366589: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:39:58] 30a87832 || 2021-12-07 10:39:58.366634: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:40:01] 152337f9 [rank=1] || 2021-12-07 10:40:01,231:INFO [56]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<>        [2021-12-07 10:40:01] 30a87832 [rank=0] || 2021-12-07 10:40:01,241:INFO [175]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<>        [2021-12-07 10:40:01] 152337f9 [rank=1] || 2021-12-07 10:40:01.350328: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:40:01] 152337f9 [rank=1] || 2021-12-07 10:40:01.350366: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:40:01] 30a87832 [rank=0] || 2021-12-07 10:40:01.361846: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:40:01] 30a87832 [rank=0] || 2021-12-07 10:40:01.361881: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:40:03] 152337f9 [rank=1] || 2021-12-07 10:40:03,889:INFO [56]: Creating TFKerasTrialController with FlowerClassificationTrial.
<>        [2021-12-07 10:40:03] 152337f9 [rank=1] || 2021-12-07 10:40:03.889393: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:40:03] 152337f9 [rank=1] || 2021-12-07 10:40:03.889658: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<>        [2021-12-07 10:40:03] 152337f9 [rank=1] || 2021-12-07 10:40:03.889637: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:40:03] 152337f9 [rank=1] || 2021-12-07 10:40:03.889689: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-sim): /proc/driver/nvidia/version does not exist
<>        [2021-12-07 10:40:03] 152337f9 [rank=1] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<>        [2021-12-07 10:40:03] 152337f9 [rank=1] || 2021-12-07 10:40:03.890611: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
<>        [2021-12-07 10:40:03] 152337f9 [rank=1] || 2021-12-07 10:40:03.892165: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:40:03] 30a87832 [rank=0] || 2021-12-07 10:40:03,906:INFO [175]: Creating TFKerasTrialController with FlowerClassificationTrial.
<>        [2021-12-07 10:40:03] 30a87832 [rank=0] || 2021-12-07 10:40:03.907125: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:40:03] 30a87832 [rank=0] || 2021-12-07 10:40:03.907483: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.1-mod): /proc/driver/nvidia/version does not exist
<>        [2021-12-07 10:40:03] 30a87832 [rank=0] || 2021-12-07 10:40:03.907423: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:40:03] 30a87832 [rank=0] || 2021-12-07 10:40:03.907447: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<>        [2021-12-07 10:40:03] 30a87832 [rank=0] || 2021-12-07 10:40:03.908658: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
<>        [2021-12-07 10:40:03] 30a87832 [rank=0] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<>        [2021-12-07 10:40:03] 30a87832 [rank=0] || 2021-12-07 10:40:03.910072: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:40:03] 152337f9 [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:40:04] 152337f9 [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:40:04] 152337f9 [rank=1] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<>        [2021-12-07 10:40:04] 152337f9 [rank=1] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<>        [2021-12-07 10:40:04] 30a87832 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:40:04] 30a87832 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:40:04] 30a87832 [rank=0] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<>        [2021-12-07 10:40:04] 30a87832 [rank=0] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<>        [2021-12-07 10:40:04] 152337f9 [rank=1] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<>        [2021-12-07 10:40:04] 30a87832 [rank=0] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<>        [2021-12-07 10:40:05] 30a87832 [rank=0] || Sequential with layers obj made
<>        [2021-12-07 10:40:05] 30a87832 [rank=0] || Wraped model in context
<>        [2021-12-07 10:40:05] 30a87832 [rank=0] || Model compiled
<>        [2021-12-07 10:40:05] 30a87832 [rank=0] || 2021-12-07 10:40:05,331:WARNING [175]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<>        [2021-12-07 10:40:05] 30a87832 [rank=0] || total batches trained: 0, workload 0% complete (0/100)
<>        [2021-12-07 10:40:05] 152337f9 [rank=1] || Sequential with layers obj made
<>        [2021-12-07 10:40:05] 152337f9 [rank=1] || Wraped model in context
<>        [2021-12-07 10:40:05] 152337f9 [rank=1] || Model compiled
<>        [2021-12-07 10:40:05] 152337f9 [rank=1] || 2021-12-07 10:40:05,461:WARNING [56]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] || 2021-12-07 10:40:07.045644: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] || 2021-12-07 10:40:07.049928: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] || Traceback (most recent call last):
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||     return _run_code(code, main_globals, None,
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||     exec(code, run_globals)
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||     sys.exit(main(args.chief_ip))
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||     controller.run()
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||     self._launch_fit()
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||     self.model.fit(
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||     tmp_logs = self.train_function(iterator)
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||     result = self._call(*args, **kwds)
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||     return self._stateless_fn(*args, **kwds)
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||     return graph_function._call_flat(
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||     return self._build_call_outputs(self._inference_function.call(
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||     outputs = execute.execute(
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||     [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] ||     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] || tensorflow.python.framework.errors_impl.DataLossError:  corrupted record at 0
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] || 
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] || 
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] || Function call stack:
<>        [2021-12-07 10:40:07] 30a87832 [rank=0] || train_function
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] || 2021-12-07 10:40:07.163460: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] || 2021-12-07 10:40:07.167464: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] || Traceback (most recent call last):
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||     return _run_code(code, main_globals, None,
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||     exec(code, run_globals)
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||     sys.exit(main(args.chief_ip))
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||     controller.run()
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||     self.model.fit(
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||     self._launch_fit()
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||     tmp_logs = self.train_function(iterator)
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||     result = self._call(*args, **kwds)
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||     return self._stateless_fn(*args, **kwds)
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||     return graph_function._call_flat(
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||     return self._build_call_outputs(self._inference_function.call(
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||     outputs = execute.execute(
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] ||     [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] || tensorflow.python.framework.errors_impl.DataLossError:  corrupted record at 0
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] || 
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] || Function call stack:
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] || train_function
<>        [2021-12-07 10:40:07] 152337f9 [rank=1] || 
<>        [2021-12-07 10:40:07] 30a87832 || Process 0 exit with status code 1.
<>        [2021-12-07 10:40:07] 30a87832 || Terminating remaining workers after failure of Process 0.
<>        [2021-12-07 10:40:07] 30a87832 || Traceback (most recent call last):
<>        [2021-12-07 10:40:07] 30a87832 ||   File "/opt/conda/bin/horovodrun", line 8, in <module>
<>        [2021-12-07 10:40:07] 30a87832 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 770, in run_commandline
<>        [2021-12-07 10:40:07] 30a87832 ||     sys.exit(run_commandline())
<>        [2021-12-07 10:40:07] 30a87832 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 617, in _run_static
<>        [2021-12-07 10:40:07] 30a87832 ||     _run(args)
<>        [2021-12-07 10:40:07] 30a87832 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 760, in _run
<>        [2021-12-07 10:40:07] 30a87832 ||     return _run_static(args)
<>        [2021-12-07 10:40:07] 30a87832 ||     _launch_job(args, settings, nics, command)
<>        [2021-12-07 10:40:07] 30a87832 ||     run_controller(args.use_gloo, gloo_run_fn,
<>        [2021-12-07 10:40:07] 30a87832 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 730, in _launch_job
<>        [2021-12-07 10:40:07] 30a87832 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 706, in run_controller
<>        [2021-12-07 10:40:07] 30a87832 ||     gloo_run()
<>        [2021-12-07 10:40:07] 30a87832 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 722, in gloo_run_fn
<>        [2021-12-07 10:40:07] 30a87832 ||     launch_gloo(command, exec_command, settings, nics, env, server_ip)
<>        [2021-12-07 10:40:07] 30a87832 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 298, in gloo_run
<>        [2021-12-07 10:40:07] 30a87832 ||     gloo_run(settings, nics, env, driver_ip, command)
<>        [2021-12-07 10:40:07] 30a87832 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 282, in launch_gloo
<>        [2021-12-07 10:40:07] 30a87832 || Process name: 0
<>        [2021-12-07 10:40:07] 30a87832 || RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
<>        [2021-12-07 10:40:07] 30a87832 ||     raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
<>        [2021-12-07 10:40:07] 30a87832 || Exit code: 1
<info>    [2021-12-07 10:40:08] 30a87832 || INFO: container failed with non-zero exit code:  (exit code 1)
<info>    [2021-12-07 10:40:25] 152337f9 || INFO: container failed with non-zero exit code:  (exit code 137)
<info>    [2021-12-07 10:40:26] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Pod resources allocated.
<info>    [2021-12-07 10:40:26] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Pod resources allocated.
<info>    [2021-12-07 10:40:27] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:40:27] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:40:28] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 895.4494ms
<info>    [2021-12-07 10:40:28] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 876.087326ms
<info>    [2021-12-07 10:40:28] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Created container determined-init-container
<info>    [2021-12-07 10:40:28] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Created container determined-init-container
<info>    [2021-12-07 10:40:28] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Started container determined-init-container
<info>    [2021-12-07 10:40:28] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Started container determined-init-container
<info>    [2021-12-07 10:40:29] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Pulling image "fluent/fluent-bit:1.6"
<info>    [2021-12-07 10:40:29] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Pulling image "fluent/fluent-bit:1.6"
<info>    [2021-12-07 10:40:30] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Successfully pulled image "fluent/fluent-bit:1.6" in 1.149313965s
<info>    [2021-12-07 10:40:30] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Created container determined-fluent-container
<info>    [2021-12-07 10:40:30] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Started container determined-fluent-container
<info>    [2021-12-07 10:40:30] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:40:30] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Successfully pulled image "fluent/fluent-bit:1.6" in 1.167861338s
<info>    [2021-12-07 10:40:30] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Created container determined-fluent-container
<info>    [2021-12-07 10:40:30] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Started container determined-fluent-container
<info>    [2021-12-07 10:40:30] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:40:31] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 881.176011ms
<info>    [2021-12-07 10:40:31] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Created container determined-container
<info>    [2021-12-07 10:40:31] c7698632 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-touching-aardvark: Started container determined-container
<info>    [2021-12-07 10:40:31] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 890.78712ms
<info>    [2021-12-07 10:40:31] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Created container determined-container
<info>    [2021-12-07 10:40:32] 514a56c5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-thorough-polecat: Started container determined-container
<>        [2021-12-07 10:40:33] c7698632 || + STARTUP_HOOK=startup-hook.sh
<>        [2021-12-07 10:40:33] c7698632 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:40:33] c7698632 || + /bin/which python3
<>        [2021-12-07 10:40:33] c7698632 || + export DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:40:33] c7698632 || + '[' -z '' ']'
<>        [2021-12-07 10:40:33] c7698632 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:40:33] c7698632 || + DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:40:33] c7698632 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<>        [2021-12-07 10:40:33] c7698632 || + '[' /root = / ']'
<>        [2021-12-07 10:40:33] 514a56c5 || + STARTUP_HOOK=startup-hook.sh
<>        [2021-12-07 10:40:33] 514a56c5 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:40:33] 514a56c5 || + DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:40:33] 514a56c5 || + '[' -z '' ']'
<>        [2021-12-07 10:40:33] 514a56c5 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:40:33] 514a56c5 || + export DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:40:33] 514a56c5 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<>        [2021-12-07 10:40:33] 514a56c5 || + '[' /root = / ']'
<>        [2021-12-07 10:40:33] 514a56c5 || + /bin/which python3
<warning> [2021-12-07 10:40:33] c7698632 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<>        [2021-12-07 10:40:33] c7698632 || + python3 -m determined.exec.prep_container --trial --resources
<warning> [2021-12-07 10:40:34] 514a56c5 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<>        [2021-12-07 10:40:34] c7698632 || + test -f startup-hook.sh
<>        [2021-12-07 10:40:34] c7698632 || + python3 -m determined.exec.prep_container --rendezvous
<>        [2021-12-07 10:40:34] 514a56c5 || + python3 -m determined.exec.prep_container --trial --resources
<>        [2021-12-07 10:40:34] 514a56c5 || + test -f startup-hook.sh
<>        [2021-12-07 10:40:34] 514a56c5 || + python3 -m determined.exec.prep_container --rendezvous
<>        [2021-12-07 10:40:34] 514a56c5 || + exec python3 -m determined.exec.launch_autohorovod
<>        [2021-12-07 10:40:34] c7698632 || + exec python3 -m determined.exec.launch_autohorovod
<info>    [2021-12-07 10:40:35] c7698632 || INFO: New trial runner in (container c7698632-a650-47bf-956b-f2faf2563c4e) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<info>    [2021-12-07 10:40:35] 514a56c5 || INFO: New trial runner in (container 514a56c5-ee6c-4fd7-8e00-2444dc5d6b30) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<>        [2021-12-07 10:40:36] 514a56c5 || 2021-12-07 10:40:36.940567: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:40:36] 514a56c5 || 2021-12-07 10:40:36.940617: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:40:39] 514a56c5 || 2021-12-07 10:40:39.175804: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:40:39] 514a56c5 || 2021-12-07 10:40:39.175851: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:40:42] c7698632 [rank=1] || 2021-12-07 10:40:42,133:INFO [56]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<>        [2021-12-07 10:40:42] c7698632 [rank=1] || 2021-12-07 10:40:42.244000: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:40:42] c7698632 [rank=1] || 2021-12-07 10:40:42.244030: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:40:42] 514a56c5 [rank=0] || 2021-12-07 10:40:42,313:INFO [207]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<>        [2021-12-07 10:40:42] 514a56c5 [rank=0] || 2021-12-07 10:40:42.457787: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:40:42] 514a56c5 [rank=0] || 2021-12-07 10:40:42.457823: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:40:45] c7698632 [rank=1] || 2021-12-07 10:40:45,061:INFO [56]: Creating TFKerasTrialController with FlowerClassificationTrial.
<>        [2021-12-07 10:40:45] c7698632 [rank=1] || 2021-12-07 10:40:45.062263: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:40:45] c7698632 [rank=1] || 2021-12-07 10:40:45.062487: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:40:45] c7698632 [rank=1] || 2021-12-07 10:40:45.062507: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<>        [2021-12-07 10:40:45] c7698632 [rank=1] || 2021-12-07 10:40:45.062539: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-tou): /proc/driver/nvidia/version does not exist
<>        [2021-12-07 10:40:45] c7698632 [rank=1] || 2021-12-07 10:40:45.063489: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
<>        [2021-12-07 10:40:45] c7698632 [rank=1] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<>        [2021-12-07 10:40:45] 514a56c5 [rank=0] || 2021-12-07 10:40:45,063:INFO [207]: Creating TFKerasTrialController with FlowerClassificationTrial.
<>        [2021-12-07 10:40:45] 514a56c5 [rank=0] || 2021-12-07 10:40:45.063814: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:40:45] 514a56c5 [rank=0] || 2021-12-07 10:40:45.064067: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<>        [2021-12-07 10:40:45] 514a56c5 [rank=0] || 2021-12-07 10:40:45.064046: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:40:45] 514a56c5 [rank=0] || 2021-12-07 10:40:45.064098: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.2-tho): /proc/driver/nvidia/version does not exist
<>        [2021-12-07 10:40:45] 514a56c5 [rank=0] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<>        [2021-12-07 10:40:45] 514a56c5 [rank=0] || 2021-12-07 10:40:45.065006: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
<>        [2021-12-07 10:40:45] c7698632 [rank=1] || 2021-12-07 10:40:45.066139: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:40:45] 514a56c5 [rank=0] || 2021-12-07 10:40:45.068060: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:40:45] c7698632 [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:40:45] 514a56c5 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:40:45] c7698632 [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:40:45] c7698632 [rank=1] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<>        [2021-12-07 10:40:45] c7698632 [rank=1] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<>        [2021-12-07 10:40:45] 514a56c5 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:40:45] 514a56c5 [rank=0] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<>        [2021-12-07 10:40:45] 514a56c5 [rank=0] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<>        [2021-12-07 10:40:45] c7698632 [rank=1] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<>        [2021-12-07 10:40:45] 514a56c5 [rank=0] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<>        [2021-12-07 10:40:46] c7698632 [rank=1] || Sequential with layers obj made
<>        [2021-12-07 10:40:46] c7698632 [rank=1] || Wraped model in context
<>        [2021-12-07 10:40:46] c7698632 [rank=1] || Model compiled
<>        [2021-12-07 10:40:46] c7698632 [rank=1] || 2021-12-07 10:40:46,543:WARNING [56]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<>        [2021-12-07 10:40:46] 514a56c5 [rank=0] || Sequential with layers obj made
<>        [2021-12-07 10:40:46] 514a56c5 [rank=0] || Wraped model in context
<>        [2021-12-07 10:40:46] 514a56c5 [rank=0] || Model compiled
<>        [2021-12-07 10:40:46] 514a56c5 [rank=0] || 2021-12-07 10:40:46,665:WARNING [207]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<>        [2021-12-07 10:40:46] 514a56c5 [rank=0] || total batches trained: 0, workload 0% complete (0/100)
<>        [2021-12-07 10:40:48] c7698632 [rank=1] || 2021-12-07 10:40:48.421840: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<>        [2021-12-07 10:40:48] c7698632 [rank=1] || 2021-12-07 10:40:48.425828: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] || 2021-12-07 10:40:48.439657: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] || 2021-12-07 10:40:48.444046: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<>        [2021-12-07 10:40:48] c7698632 [rank=1] || Traceback (most recent call last):
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||     return _run_code(code, main_globals, None,
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||     exec(code, run_globals)
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||     sys.exit(main(args.chief_ip))
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||     controller.run()
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||     self._launch_fit()
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||     self.model.fit(
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||     tmp_logs = self.train_function(iterator)
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||     result = self._call(*args, **kwds)
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||     return self._stateless_fn(*args, **kwds)
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||     return graph_function._call_flat(
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||     return self._build_call_outputs(self._inference_function.call(
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||     outputs = execute.execute(
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<>        [2021-12-07 10:40:48] c7698632 [rank=1] || tensorflow.python.framework.errors_impl.DataLossError:  corrupted record at 0
<>        [2021-12-07 10:40:48] c7698632 [rank=1] || Function call stack:
<>        [2021-12-07 10:40:48] c7698632 [rank=1] ||     [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<>        [2021-12-07 10:40:48] c7698632 [rank=1] || 
<>        [2021-12-07 10:40:48] c7698632 [rank=1] || 
<>        [2021-12-07 10:40:48] c7698632 [rank=1] || train_function
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] || Traceback (most recent call last):
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||     return _run_code(code, main_globals, None,
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||     exec(code, run_globals)
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||     sys.exit(main(args.chief_ip))
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||     controller.run()
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||     self._launch_fit()
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||     self.model.fit(
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||     tmp_logs = self.train_function(iterator)
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||     result = self._call(*args, **kwds)
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||     return self._stateless_fn(*args, **kwds)
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||     return graph_function._call_flat(
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||     return self._build_call_outputs(self._inference_function.call(
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||     outputs = execute.execute(
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] ||     [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] || tensorflow.python.framework.errors_impl.DataLossError:  corrupted record at 0
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] || train_function
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] || 
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] || Function call stack:
<>        [2021-12-07 10:40:48] 514a56c5 [rank=0] || 
<>        [2021-12-07 10:40:49] 514a56c5 || Process 1 exit with status code 1.
<>        [2021-12-07 10:40:49] 514a56c5 || Terminating remaining workers after failure of Process 1.
<>        [2021-12-07 10:40:49] 514a56c5 || [0]<stderr>:Terminated
<>        [2021-12-07 10:40:49] 514a56c5 || Process 0 exit with status code 143.
<>        [2021-12-07 10:40:49] 514a56c5 || Traceback (most recent call last):
<>        [2021-12-07 10:40:49] 514a56c5 ||   File "/opt/conda/bin/horovodrun", line 8, in <module>
<>        [2021-12-07 10:40:49] 514a56c5 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 770, in run_commandline
<>        [2021-12-07 10:40:49] 514a56c5 ||     sys.exit(run_commandline())
<>        [2021-12-07 10:40:49] 514a56c5 ||     _run(args)
<>        [2021-12-07 10:40:49] 514a56c5 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 760, in _run
<>        [2021-12-07 10:40:49] 514a56c5 ||     return _run_static(args)
<>        [2021-12-07 10:40:49] 514a56c5 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 617, in _run_static
<>        [2021-12-07 10:40:49] 514a56c5 ||     _launch_job(args, settings, nics, command)
<>        [2021-12-07 10:40:49] 514a56c5 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 730, in _launch_job
<>        [2021-12-07 10:40:49] 514a56c5 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 706, in run_controller
<>        [2021-12-07 10:40:49] 514a56c5 ||     run_controller(args.use_gloo, gloo_run_fn,
<>        [2021-12-07 10:40:49] 514a56c5 ||     gloo_run()
<>        [2021-12-07 10:40:49] 514a56c5 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 722, in gloo_run_fn
<>        [2021-12-07 10:40:49] 514a56c5 ||     gloo_run(settings, nics, env, driver_ip, command)
<>        [2021-12-07 10:40:49] 514a56c5 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 298, in gloo_run
<>        [2021-12-07 10:40:49] 514a56c5 ||     launch_gloo(command, exec_command, settings, nics, env, server_ip)
<>        [2021-12-07 10:40:49] 514a56c5 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 282, in launch_gloo
<>        [2021-12-07 10:40:49] 514a56c5 ||     raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
<>        [2021-12-07 10:40:49] 514a56c5 || Process name: 1
<>        [2021-12-07 10:40:49] 514a56c5 || RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
<>        [2021-12-07 10:40:49] 514a56c5 || Exit code: 1
<info>    [2021-12-07 10:40:51] 514a56c5 || INFO: container failed with non-zero exit code:  (exit code 1)
<info>    [2021-12-07 10:41:06] c7698632 || INFO: container failed with non-zero exit code:  (exit code 137)
<info>    [2021-12-07 10:41:08] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Pod resources allocated.
<info>    [2021-12-07 10:41:08] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Pod resources allocated.
<info>    [2021-12-07 10:41:09] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:41:09] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:41:10] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 874.693887ms
<info>    [2021-12-07 10:41:10] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Created container determined-init-container
<info>    [2021-12-07 10:41:10] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 890.053052ms
<info>    [2021-12-07 10:41:10] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Created container determined-init-container
<info>    [2021-12-07 10:41:10] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Started container determined-init-container
<info>    [2021-12-07 10:41:10] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Started container determined-init-container
<info>    [2021-12-07 10:41:11] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Pulling image "fluent/fluent-bit:1.6"
<info>    [2021-12-07 10:41:11] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Pulling image "fluent/fluent-bit:1.6"
<info>    [2021-12-07 10:41:12] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Successfully pulled image "fluent/fluent-bit:1.6" in 1.159570999s
<info>    [2021-12-07 10:41:12] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Created container determined-fluent-container
<info>    [2021-12-07 10:41:12] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Started container determined-fluent-container
<info>    [2021-12-07 10:41:12] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:41:12] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Successfully pulled image "fluent/fluent-bit:1.6" in 1.17294199s
<info>    [2021-12-07 10:41:12] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Created container determined-fluent-container
<info>    [2021-12-07 10:41:13] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Started container determined-fluent-container
<info>    [2021-12-07 10:41:13] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:41:13] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 877.471658ms
<info>    [2021-12-07 10:41:13] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Created container determined-container
<info>    [2021-12-07 10:41:13] d173154e || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-frank-snail: Started container determined-container
<info>    [2021-12-07 10:41:13] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 907.551689ms
<info>    [2021-12-07 10:41:14] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Created container determined-container
<info>    [2021-12-07 10:41:14] 5fabfb53 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sharp-foxhound: Started container determined-container
<>        [2021-12-07 10:41:15] d173154e || + STARTUP_HOOK=startup-hook.sh
<>        [2021-12-07 10:41:15] d173154e || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:41:15] d173154e || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:41:15] d173154e || + DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:41:15] d173154e || + export DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:41:15] d173154e || + /bin/which python3
<>        [2021-12-07 10:41:15] d173154e || + '[' -z '' ']'
<>        [2021-12-07 10:41:15] d173154e || + '[' /root = / ']'
<>        [2021-12-07 10:41:15] d173154e || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<>        [2021-12-07 10:41:15] 5fabfb53 || + STARTUP_HOOK=startup-hook.sh
<>        [2021-12-07 10:41:15] 5fabfb53 || + '[' -z '' ']'
<>        [2021-12-07 10:41:15] 5fabfb53 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:41:15] 5fabfb53 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:41:15] 5fabfb53 || + DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:41:15] 5fabfb53 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<>        [2021-12-07 10:41:15] 5fabfb53 || + '[' /root = / ']'
<>        [2021-12-07 10:41:15] 5fabfb53 || + export DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:41:15] 5fabfb53 || + /bin/which python3
<warning> [2021-12-07 10:41:15] d173154e || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<>        [2021-12-07 10:41:16] d173154e || + python3 -m determined.exec.prep_container --trial --resources
<>        [2021-12-07 10:41:16] d173154e || + test -f startup-hook.sh
<>        [2021-12-07 10:41:16] d173154e || + python3 -m determined.exec.prep_container --rendezvous
<warning> [2021-12-07 10:41:16] 5fabfb53 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<>        [2021-12-07 10:41:16] 5fabfb53 || + python3 -m determined.exec.prep_container --trial --resources
<>        [2021-12-07 10:41:17] 5fabfb53 || + test -f startup-hook.sh
<>        [2021-12-07 10:41:17] 5fabfb53 || + python3 -m determined.exec.prep_container --rendezvous
<>        [2021-12-07 10:41:17] d173154e || + exec python3 -m determined.exec.launch_autohorovod
<>        [2021-12-07 10:41:17] 5fabfb53 || + exec python3 -m determined.exec.launch_autohorovod
<info>    [2021-12-07 10:41:17] d173154e || INFO: New trial runner in (container d173154e-3575-4d2e-8a49-d50c77092e5a) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<info>    [2021-12-07 10:41:17] 5fabfb53 || INFO: New trial runner in (container 5fabfb53-a425-43b6-8bf5-3ba30a33917d) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<>        [2021-12-07 10:41:19] 5fabfb53 || 2021-12-07 10:41:19.622471: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:41:19] 5fabfb53 || 2021-12-07 10:41:19.622532: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:41:21] 5fabfb53 || 2021-12-07 10:41:21.869342: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:41:21] 5fabfb53 || 2021-12-07 10:41:21.869397: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:41:24] 5fabfb53 [rank=0] || 2021-12-07 10:41:24,869:INFO [207]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<>        [2021-12-07 10:41:24] 5fabfb53 [rank=0] || 2021-12-07 10:41:24.998598: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:41:24] 5fabfb53 [rank=0] || 2021-12-07 10:41:24.998635: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:41:25] d173154e [rank=1] || 2021-12-07 10:41:25,019:INFO [56]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<>        [2021-12-07 10:41:25] d173154e [rank=1] || 2021-12-07 10:41:25.138037: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:41:25] d173154e [rank=1] || 2021-12-07 10:41:25.138069: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:41:27] 5fabfb53 [rank=0] || 2021-12-07 10:41:27,732:INFO [207]: Creating TFKerasTrialController with FlowerClassificationTrial.
<>        [2021-12-07 10:41:27] 5fabfb53 [rank=0] || 2021-12-07 10:41:27.733351: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:41:27] 5fabfb53 [rank=0] || 2021-12-07 10:41:27.733638: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:41:27] 5fabfb53 [rank=0] || 2021-12-07 10:41:27.733663: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<>        [2021-12-07 10:41:27] 5fabfb53 [rank=0] || 2021-12-07 10:41:27.733701: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-sha): /proc/driver/nvidia/version does not exist
<>        [2021-12-07 10:41:27] 5fabfb53 [rank=0] || 2021-12-07 10:41:27.734901: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
<>        [2021-12-07 10:41:27] 5fabfb53 [rank=0] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<>        [2021-12-07 10:41:27] 5fabfb53 [rank=0] || 2021-12-07 10:41:27.736524: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:41:27] d173154e [rank=1] || 2021-12-07 10:41:27,740:INFO [56]: Creating TFKerasTrialController with FlowerClassificationTrial.
<>        [2021-12-07 10:41:27] d173154e [rank=1] || 2021-12-07 10:41:27.740664: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:41:27] d173154e [rank=1] || 2021-12-07 10:41:27.740919: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:41:27] d173154e [rank=1] || 2021-12-07 10:41:27.740947: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<>        [2021-12-07 10:41:27] d173154e [rank=1] || 2021-12-07 10:41:27.740980: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.3-fra): /proc/driver/nvidia/version does not exist
<>        [2021-12-07 10:41:27] d173154e [rank=1] || 2021-12-07 10:41:27.742130: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
<>        [2021-12-07 10:41:27] d173154e [rank=1] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<>        [2021-12-07 10:41:27] d173154e [rank=1] || 2021-12-07 10:41:27.743424: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:41:27] 5fabfb53 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:41:27] d173154e [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:41:27] 5fabfb53 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:41:27] 5fabfb53 [rank=0] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<>        [2021-12-07 10:41:27] 5fabfb53 [rank=0] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<>        [2021-12-07 10:41:27] d173154e [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:41:27] d173154e [rank=1] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<>        [2021-12-07 10:41:27] d173154e [rank=1] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<>        [2021-12-07 10:41:28] d173154e [rank=1] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<>        [2021-12-07 10:41:28] 5fabfb53 [rank=0] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<>        [2021-12-07 10:41:29] d173154e [rank=1] || Sequential with layers obj made
<>        [2021-12-07 10:41:29] d173154e [rank=1] || Wraped model in context
<>        [2021-12-07 10:41:29] d173154e [rank=1] || Model compiled
<>        [2021-12-07 10:41:29] d173154e [rank=1] || 2021-12-07 10:41:29,300:WARNING [56]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<>        [2021-12-07 10:41:29] 5fabfb53 [rank=0] || Sequential with layers obj made
<>        [2021-12-07 10:41:29] 5fabfb53 [rank=0] || Wraped model in context
<>        [2021-12-07 10:41:29] 5fabfb53 [rank=0] || Model compiled
<>        [2021-12-07 10:41:29] 5fabfb53 [rank=0] || 2021-12-07 10:41:29,375:WARNING [207]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<>        [2021-12-07 10:41:29] 5fabfb53 [rank=0] || total batches trained: 0, workload 0% complete (0/100)
<>        [2021-12-07 10:41:31] d173154e [rank=1] || 2021-12-07 10:41:31.057916: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<>        [2021-12-07 10:41:31] d173154e [rank=1] || 2021-12-07 10:41:31.062159: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<>        [2021-12-07 10:41:31] d173154e [rank=1] || Traceback (most recent call last):
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||     return _run_code(code, main_globals, None,
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||     exec(code, run_globals)
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||     sys.exit(main(args.chief_ip))
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||     controller.run()
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||     self._launch_fit()
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||     self.model.fit(
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||     tmp_logs = self.train_function(iterator)
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||     result = self._call(*args, **kwds)
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||     return self._stateless_fn(*args, **kwds)
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||     return graph_function._call_flat(
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||     return self._build_call_outputs(self._inference_function.call(
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||     outputs = execute.execute(
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<>        [2021-12-07 10:41:31] d173154e [rank=1] || tensorflow.python.framework.errors_impl.DataLossError:  corrupted record at 0
<>        [2021-12-07 10:41:31] d173154e [rank=1] || 
<>        [2021-12-07 10:41:31] d173154e [rank=1] ||     [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<>        [2021-12-07 10:41:31] d173154e [rank=1] || Function call stack:
<>        [2021-12-07 10:41:31] d173154e [rank=1] || train_function
<>        [2021-12-07 10:41:31] d173154e [rank=1] || 
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] || 2021-12-07 10:41:31.342307: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] || 2021-12-07 10:41:31.347290: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] || Traceback (most recent call last):
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||     return _run_code(code, main_globals, None,
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||     exec(code, run_globals)
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||     sys.exit(main(args.chief_ip))
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||     controller.run()
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||     self._launch_fit()
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||     self.model.fit(
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||     tmp_logs = self.train_function(iterator)
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||     result = self._call(*args, **kwds)
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||     return self._stateless_fn(*args, **kwds)
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||     return graph_function._call_flat(
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||     return self._build_call_outputs(self._inference_function.call(
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||     outputs = execute.execute(
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] || tensorflow.python.framework.errors_impl.DataLossError:  corrupted record at 0
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] ||     [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] || 
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] || Function call stack:
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] || train_function
<>        [2021-12-07 10:41:31] 5fabfb53 [rank=0] || 
<>        [2021-12-07 10:41:32] 5fabfb53 || Process 1 exit with status code 1.
<>        [2021-12-07 10:41:32] 5fabfb53 || Terminating remaining workers after failure of Process 1.
<>        [2021-12-07 10:41:32] 5fabfb53 || [0]<stderr>:Terminated
<>        [2021-12-07 10:41:32] 5fabfb53 || Process 0 exit with status code 143.
<>        [2021-12-07 10:41:32] 5fabfb53 || Traceback (most recent call last):
<>        [2021-12-07 10:41:32] 5fabfb53 ||   File "/opt/conda/bin/horovodrun", line 8, in <module>
<>        [2021-12-07 10:41:32] 5fabfb53 ||     sys.exit(run_commandline())
<>        [2021-12-07 10:41:32] 5fabfb53 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 770, in run_commandline
<>        [2021-12-07 10:41:32] 5fabfb53 ||     _run(args)
<>        [2021-12-07 10:41:32] 5fabfb53 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 760, in _run
<>        [2021-12-07 10:41:32] 5fabfb53 ||     return _run_static(args)
<>        [2021-12-07 10:41:32] 5fabfb53 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 617, in _run_static
<>        [2021-12-07 10:41:32] 5fabfb53 ||     _launch_job(args, settings, nics, command)
<>        [2021-12-07 10:41:32] 5fabfb53 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 730, in _launch_job
<>        [2021-12-07 10:41:32] 5fabfb53 ||     run_controller(args.use_gloo, gloo_run_fn,
<>        [2021-12-07 10:41:32] 5fabfb53 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 706, in run_controller
<>        [2021-12-07 10:41:32] 5fabfb53 ||     gloo_run()
<>        [2021-12-07 10:41:32] 5fabfb53 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 722, in gloo_run_fn
<>        [2021-12-07 10:41:32] 5fabfb53 ||     gloo_run(settings, nics, env, driver_ip, command)
<>        [2021-12-07 10:41:32] 5fabfb53 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 298, in gloo_run
<>        [2021-12-07 10:41:32] 5fabfb53 ||     launch_gloo(command, exec_command, settings, nics, env, server_ip)
<>        [2021-12-07 10:41:32] 5fabfb53 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 282, in launch_gloo
<>        [2021-12-07 10:41:32] 5fabfb53 ||     raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
<>        [2021-12-07 10:41:32] 5fabfb53 || RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
<>        [2021-12-07 10:41:32] 5fabfb53 || Process name: 1
<>        [2021-12-07 10:41:32] 5fabfb53 || Exit code: 1
<info>    [2021-12-07 10:41:36] 5fabfb53 || INFO: container failed with non-zero exit code:  (exit code 1)
<info>    [2021-12-07 10:41:53] d173154e || INFO: container failed with non-zero exit code:  (exit code 137)
<info>    [2021-12-07 10:41:54] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Pod resources allocated.
<info>    [2021-12-07 10:41:54] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Pod resources allocated.
<info>    [2021-12-07 10:41:55] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:41:55] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:41:56] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 901.4561ms
<info>    [2021-12-07 10:41:56] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 890.545699ms
<info>    [2021-12-07 10:41:56] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Created container determined-init-container
<info>    [2021-12-07 10:41:56] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Created container determined-init-container
<info>    [2021-12-07 10:41:56] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Started container determined-init-container
<info>    [2021-12-07 10:41:56] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Started container determined-init-container
<info>    [2021-12-07 10:41:56] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Pulling image "fluent/fluent-bit:1.6"
<info>    [2021-12-07 10:41:57] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Pulling image "fluent/fluent-bit:1.6"
<info>    [2021-12-07 10:41:58] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Successfully pulled image "fluent/fluent-bit:1.6" in 1.160234547s
<info>    [2021-12-07 10:41:58] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Created container determined-fluent-container
<info>    [2021-12-07 10:41:58] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Started container determined-fluent-container
<info>    [2021-12-07 10:41:58] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:41:58] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Successfully pulled image "fluent/fluent-bit:1.6" in 1.178977281s
<info>    [2021-12-07 10:41:58] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Created container determined-fluent-container
<info>    [2021-12-07 10:41:58] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Started container determined-fluent-container
<info>    [2021-12-07 10:41:58] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:41:59] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 873.101702ms
<info>    [2021-12-07 10:41:59] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Created container determined-container
<info>    [2021-12-07 10:41:59] 6ed813d5 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-capital-lizard: Started container determined-container
<info>    [2021-12-07 10:41:59] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 886.217585ms
<info>    [2021-12-07 10:41:59] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Created container determined-container
<info>    [2021-12-07 10:41:59] 1d0cefc1 || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-maximum-escargot: Started container determined-container
<>        [2021-12-07 10:42:00] 6ed813d5 || + STARTUP_HOOK=startup-hook.sh
<>        [2021-12-07 10:42:00] 6ed813d5 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:42:00] 6ed813d5 || + DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:42:00] 6ed813d5 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:42:00] 6ed813d5 || + '[' -z '' ']'
<>        [2021-12-07 10:42:00] 6ed813d5 || + export DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:42:00] 6ed813d5 || + /bin/which python3
<>        [2021-12-07 10:42:00] 6ed813d5 || + '[' /root = / ']'
<>        [2021-12-07 10:42:00] 6ed813d5 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<>        [2021-12-07 10:42:01] 1d0cefc1 || + STARTUP_HOOK=startup-hook.sh
<>        [2021-12-07 10:42:01] 1d0cefc1 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:42:01] 1d0cefc1 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:42:01] 1d0cefc1 || + DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:42:01] 1d0cefc1 || + '[' -z '' ']'
<>        [2021-12-07 10:42:01] 1d0cefc1 || + export DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:42:01] 1d0cefc1 || + /bin/which python3
<>        [2021-12-07 10:42:01] 1d0cefc1 || + '[' /root = / ']'
<>        [2021-12-07 10:42:01] 1d0cefc1 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<warning> [2021-12-07 10:42:01] 6ed813d5 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<>        [2021-12-07 10:42:01] 6ed813d5 || + python3 -m determined.exec.prep_container --trial --resources
<warning> [2021-12-07 10:42:01] 1d0cefc1 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<>        [2021-12-07 10:42:02] 6ed813d5 || + test -f startup-hook.sh
<>        [2021-12-07 10:42:02] 6ed813d5 || + python3 -m determined.exec.prep_container --rendezvous
<>        [2021-12-07 10:42:02] 1d0cefc1 || + python3 -m determined.exec.prep_container --trial --resources
<>        [2021-12-07 10:42:02] 1d0cefc1 || + test -f startup-hook.sh
<>        [2021-12-07 10:42:02] 1d0cefc1 || + python3 -m determined.exec.prep_container --rendezvous
<>        [2021-12-07 10:42:02] 1d0cefc1 || + exec python3 -m determined.exec.launch_autohorovod
<>        [2021-12-07 10:42:02] 6ed813d5 || + exec python3 -m determined.exec.launch_autohorovod
<info>    [2021-12-07 10:42:02] 1d0cefc1 || INFO: New trial runner in (container 1d0cefc1-5ce1-4fec-9171-4d6addc9d458) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<info>    [2021-12-07 10:42:02] 6ed813d5 || INFO: New trial runner in (container 6ed813d5-5d25-4ce9-a9fc-d3edb66de7a1) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<>        [2021-12-07 10:42:04] 6ed813d5 || 2021-12-07 10:42:04.677820: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:42:04] 6ed813d5 || 2021-12-07 10:42:04.677866: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:42:06] 6ed813d5 || 2021-12-07 10:42:06.730998: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:42:06] 6ed813d5 || 2021-12-07 10:42:06.731044: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:42:09] 6ed813d5 [rank=0] || 2021-12-07 10:42:09,844:INFO [207]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<>        [2021-12-07 10:42:09] 1d0cefc1 [rank=1] || 2021-12-07 10:42:09,940:INFO [56]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<>        [2021-12-07 10:42:09] 6ed813d5 [rank=0] || 2021-12-07 10:42:09.952059: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:42:09] 6ed813d5 [rank=0] || 2021-12-07 10:42:09.952094: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:42:10] 1d0cefc1 [rank=1] || 2021-12-07 10:42:10.055760: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:42:10] 1d0cefc1 [rank=1] || 2021-12-07 10:42:10.055787: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:42:12] 6ed813d5 [rank=0] || 2021-12-07 10:42:12,378:INFO [207]: Creating TFKerasTrialController with FlowerClassificationTrial.
<>        [2021-12-07 10:42:12] 6ed813d5 [rank=0] || 2021-12-07 10:42:12.379302: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:42:12] 6ed813d5 [rank=0] || 2021-12-07 10:42:12.379527: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:42:12] 6ed813d5 [rank=0] || 2021-12-07 10:42:12.379547: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<>        [2021-12-07 10:42:12] 6ed813d5 [rank=0] || 2021-12-07 10:42:12.379580: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-cap): /proc/driver/nvidia/version does not exist
<>        [2021-12-07 10:42:12] 6ed813d5 [rank=0] || 2021-12-07 10:42:12.380470: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
<>        [2021-12-07 10:42:12] 6ed813d5 [rank=0] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<>        [2021-12-07 10:42:12] 6ed813d5 [rank=0] || 2021-12-07 10:42:12.383423: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || 2021-12-07 10:42:12,385:INFO [56]: Creating TFKerasTrialController with FlowerClassificationTrial.
<>        [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || 2021-12-07 10:42:12.385478: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || 2021-12-07 10:42:12.385740: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || 2021-12-07 10:42:12.385760: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<>        [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || 2021-12-07 10:42:12.385787: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.4-max): /proc/driver/nvidia/version does not exist
<>        [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || 2021-12-07 10:42:12.386672: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
<>        [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<>        [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || 2021-12-07 10:42:12.388711: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:42:12] 6ed813d5 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<>        [2021-12-07 10:42:12] 6ed813d5 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:42:12] 6ed813d5 [rank=0] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<>        [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<>        [2021-12-07 10:42:12] 6ed813d5 [rank=0] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<>        [2021-12-07 10:42:12] 1d0cefc1 [rank=1] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<>        [2021-12-07 10:42:12] 6ed813d5 [rank=0] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<>        [2021-12-07 10:42:13] 1d0cefc1 [rank=1] || Sequential with layers obj made
<>        [2021-12-07 10:42:13] 1d0cefc1 [rank=1] || Wraped model in context
<>        [2021-12-07 10:42:13] 1d0cefc1 [rank=1] || Model compiled
<>        [2021-12-07 10:42:13] 6ed813d5 [rank=0] || Sequential with layers obj made
<>        [2021-12-07 10:42:13] 6ed813d5 [rank=0] || Wraped model in context
<>        [2021-12-07 10:42:13] 6ed813d5 [rank=0] || Model compiled
<>        [2021-12-07 10:42:13] 1d0cefc1 [rank=1] || 2021-12-07 10:42:13,946:WARNING [56]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<>        [2021-12-07 10:42:13] 6ed813d5 [rank=0] || 2021-12-07 10:42:13,964:WARNING [207]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<>        [2021-12-07 10:42:13] 6ed813d5 [rank=0] || total batches trained: 0, workload 0% complete (0/100)
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || 2021-12-07 10:42:15.575952: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || 2021-12-07 10:42:15.579902: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || Traceback (most recent call last):
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||     return _run_code(code, main_globals, None,
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||     exec(code, run_globals)
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||     sys.exit(main(args.chief_ip))
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||     controller.run()
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||     self._launch_fit()
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||     self.model.fit(
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||     tmp_logs = self.train_function(iterator)
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||     result = self._call(*args, **kwds)
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||     return self._stateless_fn(*args, **kwds)
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||     return graph_function._call_flat(
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||     return self._build_call_outputs(self._inference_function.call(
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||     outputs = execute.execute(
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||     [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] ||     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || tensorflow.python.framework.errors_impl.DataLossError:  corrupted record at 0
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || train_function
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || 
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || Function call stack:
<>        [2021-12-07 10:42:15] 1d0cefc1 [rank=1] || 
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] || 2021-12-07 10:42:15.741891: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] || 2021-12-07 10:42:15.745930: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] || Traceback (most recent call last):
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||     return _run_code(code, main_globals, None,
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||     exec(code, run_globals)
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||     sys.exit(main(args.chief_ip))
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||     controller.run()
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||     self._launch_fit()
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||     self.model.fit(
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||     tmp_logs = self.train_function(iterator)
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||     result = self._call(*args, **kwds)
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||     return self._stateless_fn(*args, **kwds)
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||     return graph_function._call_flat(
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||     return self._build_call_outputs(self._inference_function.call(
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||     outputs = execute.execute(
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] || tensorflow.python.framework.errors_impl.DataLossError:  corrupted record at 0
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] ||     [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] || Function call stack:
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] || 
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] || train_function
<>        [2021-12-07 10:42:15] 6ed813d5 [rank=0] || 
<>        [2021-12-07 10:42:16] 6ed813d5 || Process 1 exit with status code 1.
<>        [2021-12-07 10:42:16] 6ed813d5 || Terminating remaining workers after failure of Process 1.
<>        [2021-12-07 10:42:16] 6ed813d5 || [0]<stderr>:Terminated
<>        [2021-12-07 10:42:16] 6ed813d5 || Process 0 exit with status code 143.
<>        [2021-12-07 10:42:16] 6ed813d5 || Traceback (most recent call last):
<>        [2021-12-07 10:42:16] 6ed813d5 ||   File "/opt/conda/bin/horovodrun", line 8, in <module>
<>        [2021-12-07 10:42:16] 6ed813d5 ||     sys.exit(run_commandline())
<>        [2021-12-07 10:42:16] 6ed813d5 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 770, in run_commandline
<>        [2021-12-07 10:42:16] 6ed813d5 ||     _run(args)
<>        [2021-12-07 10:42:16] 6ed813d5 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 760, in _run
<>        [2021-12-07 10:42:16] 6ed813d5 ||     return _run_static(args)
<>        [2021-12-07 10:42:16] 6ed813d5 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 617, in _run_static
<>        [2021-12-07 10:42:16] 6ed813d5 ||     _launch_job(args, settings, nics, command)
<>        [2021-12-07 10:42:16] 6ed813d5 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 730, in _launch_job
<>        [2021-12-07 10:42:16] 6ed813d5 ||     run_controller(args.use_gloo, gloo_run_fn,
<>        [2021-12-07 10:42:16] 6ed813d5 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 706, in run_controller
<>        [2021-12-07 10:42:16] 6ed813d5 ||     gloo_run()
<>        [2021-12-07 10:42:16] 6ed813d5 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 722, in gloo_run_fn
<>        [2021-12-07 10:42:16] 6ed813d5 ||     gloo_run(settings, nics, env, driver_ip, command)
<>        [2021-12-07 10:42:16] 6ed813d5 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 298, in gloo_run
<>        [2021-12-07 10:42:16] 6ed813d5 ||     launch_gloo(command, exec_command, settings, nics, env, server_ip)
<>        [2021-12-07 10:42:16] 6ed813d5 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 282, in launch_gloo
<>        [2021-12-07 10:42:16] 6ed813d5 || Process name: 1
<>        [2021-12-07 10:42:16] 6ed813d5 ||     raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
<>        [2021-12-07 10:42:16] 6ed813d5 || RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
<>        [2021-12-07 10:42:16] 6ed813d5 || Exit code: 1
<info>    [2021-12-07 10:42:20] 6ed813d5 || INFO: container failed with non-zero exit code:  (exit code 1)
<info>    [2021-12-07 10:42:36] 1d0cefc1 || INFO: rpc error: code = Unknown desc = Error: No such container: 8c37c94994ab83ed1ae13fbba12b7ec578361f0db5da1a5ea49e91dd205bbc4b
<info>    [2021-12-07 10:42:37] 1d0cefc1 || INFO: container failed with non-zero exit code:  (exit code 137)
<info>    [2021-12-07 10:42:38] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Pod resources allocated.
<info>    [2021-12-07 10:42:38] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Pod resources allocated.
<info>    [2021-12-07 10:42:39] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:42:39] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:42:40] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 876.439508ms
<info>    [2021-12-07 10:42:40] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 871.463514ms
<info>    [2021-12-07 10:42:40] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Created container determined-init-container
<info>    [2021-12-07 10:42:40] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Created container determined-init-container
<info>    [2021-12-07 10:42:40] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Started container determined-init-container
<info>    [2021-12-07 10:42:40] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Started container determined-init-container
<info>    [2021-12-07 10:42:40] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Pulling image "fluent/fluent-bit:1.6"
<info>    [2021-12-07 10:42:41] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Pulling image "fluent/fluent-bit:1.6"
<info>    [2021-12-07 10:42:42] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Successfully pulled image "fluent/fluent-bit:1.6" in 1.177963649s
<info>    [2021-12-07 10:42:42] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Created container determined-fluent-container
<info>    [2021-12-07 10:42:42] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Started container determined-fluent-container
<info>    [2021-12-07 10:42:42] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:42:42] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Successfully pulled image "fluent/fluent-bit:1.6" in 1.189195924s
<info>    [2021-12-07 10:42:42] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Created container determined-fluent-container
<info>    [2021-12-07 10:42:42] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Started container determined-fluent-container
<info>    [2021-12-07 10:42:42] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Pulling image "ramakrishna1592/flower-classification-determinedai:v1"
<info>    [2021-12-07 10:42:43] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 884.846614ms
<info>    [2021-12-07 10:42:43] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Created container determined-container
<info>    [2021-12-07 10:42:43] 087c0ee7 || INFO: Pod exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ready-duckling: Started container determined-container
<info>    [2021-12-07 10:42:43] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Successfully pulled image "ramakrishna1592/flower-classification-determinedai:v1" in 900.212087ms
<info>    [2021-12-07 10:42:43] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Created container determined-container
<info>    [2021-12-07 10:42:44] ef45fdce || INFO: Pod exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-adapted-perch: Started container determined-container
<>        [2021-12-07 10:42:44] 087c0ee7 || + STARTUP_HOOK=startup-hook.sh
<>        [2021-12-07 10:42:44] 087c0ee7 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:42:44] 087c0ee7 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:42:44] 087c0ee7 || + '[' -z '' ']'
<>        [2021-12-07 10:42:44] 087c0ee7 || + export DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:42:44] 087c0ee7 || + DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:42:44] 087c0ee7 || + '[' /root = / ']'
<>        [2021-12-07 10:42:44] 087c0ee7 || + /bin/which python3
<>        [2021-12-07 10:42:44] 087c0ee7 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<>        [2021-12-07 10:42:45] ef45fdce || + STARTUP_HOOK=startup-hook.sh
<>        [2021-12-07 10:42:45] ef45fdce || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:42:45] ef45fdce || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
<>        [2021-12-07 10:42:45] ef45fdce || + DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:42:45] ef45fdce || + '[' -z '' ']'
<>        [2021-12-07 10:42:45] ef45fdce || + export DET_PYTHON_EXECUTABLE=python3
<>        [2021-12-07 10:42:45] ef45fdce || + /bin/which python3
<>        [2021-12-07 10:42:45] ef45fdce || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.17.3-py3-none-any.whl
<>        [2021-12-07 10:42:45] ef45fdce || + '[' /root = / ']'
<warning> [2021-12-07 10:42:45] 087c0ee7 || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<>        [2021-12-07 10:42:45] 087c0ee7 || + python3 -m determined.exec.prep_container --trial --resources
<warning> [2021-12-07 10:42:45] ef45fdce || WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<>        [2021-12-07 10:42:46] 087c0ee7 || + test -f startup-hook.sh
<>        [2021-12-07 10:42:46] 087c0ee7 || + python3 -m determined.exec.prep_container --rendezvous
<>        [2021-12-07 10:42:46] ef45fdce || + python3 -m determined.exec.prep_container --trial --resources
<>        [2021-12-07 10:42:46] ef45fdce || + test -f startup-hook.sh
<>        [2021-12-07 10:42:46] ef45fdce || + python3 -m determined.exec.prep_container --rendezvous
<>        [2021-12-07 10:42:46] 087c0ee7 || + exec python3 -m determined.exec.launch_autohorovod
<>        [2021-12-07 10:42:46] ef45fdce || + exec python3 -m determined.exec.launch_autohorovod
<info>    [2021-12-07 10:42:46] 087c0ee7 || INFO: New trial runner in (container 087c0ee7-97fa-47da-a4d2-70db4437a561) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<info>    [2021-12-07 10:42:46] ef45fdce || INFO: New trial runner in (container ef45fdce-eb10-4fc0-991b-8df1e20d41d4) on agent k8agent: {"bind_mounts": [], "checkpoint_policy": "best", "checkpoint_storage": {"host_path": "/checkpoints", "propagation": "rprivate", "save_experiment_best": 0, "save_trial_best": 1, "save_trial_latest": 1, "storage_path": null, "type": "shared_fs"}, "data_layer": {"container_storage_path": null, "host_storage_path": null, "type": "shared_fs"}, "data": {}, "debug": false, "description": null, "entrypoint": "model_def2:FlowerClassificationTrial", "environment": {"image": {"cpu": "ramakrishna1592/flower-classification-determinedai:v1", "gpu": "ramakrishna1592/flower-classification-determinedai:v1"}, "environment_variables": {"cpu": [], "gpu": []}, "ports": {"trial": 1734}, "registry_auth": null, "force_pull_image": false, "pod_spec": {"metadata": {"creationTimestamp": null}, "spec": {"containers": null}, "status": {}}, "add_capabilities": [], "drop_capabilities": []}, "hyperparameters": {"dense1": {"type": "const", "val": 128}, "global_batch_size": {"type": "const", "val": 256}}, "labels": [], "max_restarts": 5, "min_checkpoint_period": {"batches": 0}, "min_validation_period": {"batches": 0}, "name": "flower-classification", "optimizations": {"aggregation_frequency": 1, "average_aggregated_gradients": true, "average_training_metrics": false, "gradient_compression": false, "grad_updates_size_file": null, "mixed_precision": "O0", "tensor_fusion_threshold": 64, "tensor_fusion_cycle_time": 5, "auto_tune_tensor_fusion": false}, "perform_initial_validation": false, "profiling": {"enabled": false, "begin_on_batch": 0, "end_after_batch": null, "sync_timings": true}, "records_per_epoch": 60000, "reproducibility": {"experiment_seed": 1638873543}, "resources": {"max_slots": null, "slots_per_trial": 2, "weight": 1, "native_parallel": false, "shm_size": null, "agent_label": "", "resource_pool": "", "priority": null, "devices": []}, "scheduling_unit": 100, "searcher": {"max_length": {"epochs": 5}, "metric": "val_accuracy", "name": "single", "smaller_is_better": false, "source_checkpoint_uuid": null, "source_trial_id": null}}
<>        [2021-12-07 10:42:49] 087c0ee7 || 2021-12-07 10:42:49.029033: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:42:49] 087c0ee7 || 2021-12-07 10:42:49.029086: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:42:51] 087c0ee7 || 2021-12-07 10:42:51.287981: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:42:51] 087c0ee7 || 2021-12-07 10:42:51.288043: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:42:54] ef45fdce [rank=1] || 2021-12-07 10:42:54,280:INFO [56]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<>        [2021-12-07 10:42:54] 087c0ee7 [rank=0] || 2021-12-07 10:42:54,343:INFO [207]: Loading Trial implementation with entrypoint model_def2:FlowerClassificationTrial.
<>        [2021-12-07 10:42:54] ef45fdce [rank=1] || 2021-12-07 10:42:54.393298: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:42:54] ef45fdce [rank=1] || 2021-12-07 10:42:54.393329: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:42:54] 087c0ee7 [rank=0] || 2021-12-07 10:42:54.453085: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:42:54] 087c0ee7 [rank=0] || 2021-12-07 10:42:54.453121: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
<>        [2021-12-07 10:42:56] ef45fdce [rank=1] || 2021-12-07 10:42:56,847:INFO [56]: Creating TFKerasTrialController with FlowerClassificationTrial.
<>        [2021-12-07 10:42:56] ef45fdce [rank=1] || 2021-12-07 10:42:56.847483: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:42:56] ef45fdce [rank=1] || 2021-12-07 10:42:56.847812: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:42:56] ef45fdce [rank=1] || 2021-12-07 10:42:56.847836: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<>        [2021-12-07 10:42:56] ef45fdce [rank=1] || 2021-12-07 10:42:56.847870: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-1-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-ada): /proc/driver/nvidia/version does not exist
<>        [2021-12-07 10:42:56] ef45fdce [rank=1] || 2021-12-07 10:42:56.848859: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
<>        [2021-12-07 10:42:56] ef45fdce [rank=1] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<>        [2021-12-07 10:42:56] ef45fdce [rank=1] || 2021-12-07 10:42:56.850466: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:42:56] 087c0ee7 [rank=0] || 2021-12-07 10:42:56,868:INFO [207]: Creating TFKerasTrialController with FlowerClassificationTrial.
<>        [2021-12-07 10:42:56] 087c0ee7 [rank=0] || 2021-12-07 10:42:56.868944: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:42:56] 087c0ee7 [rank=0] || 2021-12-07 10:42:56.869343: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
<>        [2021-12-07 10:42:56] 087c0ee7 [rank=0] || 2021-12-07 10:42:56.869393: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
<>        [2021-12-07 10:42:56] 087c0ee7 [rank=0] || 2021-12-07 10:42:56.869467: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (exp-20-trial-20-0-20.a75316de-82f1-4b56-bd9b-8175ce8cd6e0.5-rea): /proc/driver/nvidia/version does not exist
<>        [2021-12-07 10:42:56] 087c0ee7 [rank=0] || 2021-12-07 10:42:56.871373: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
<>        [2021-12-07 10:42:56] 087c0ee7 [rank=0] || To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<>        [2021-12-07 10:42:56] 087c0ee7 [rank=0] || 2021-12-07 10:42:56.874729: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
<>        [2021-12-07 10:42:56] ef45fdce [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:42:56] ef45fdce [rank=1] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:42:56] ef45fdce [rank=1] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<>        [2021-12-07 10:42:56] ef45fdce [rank=1] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<>        [2021-12-07 10:42:57] 087c0ee7 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:42:57] 087c0ee7 [rank=0] || <ParallelMapDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ParallelMapDataset'>
<>        [2021-12-07 10:42:57] 087c0ee7 [rank=0] || <ShardDataset shapes: ((512, 512, 3), ()), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.ShardDataset'>
<>        [2021-12-07 10:42:57] 087c0ee7 [rank=0] || <PrefetchDataset shapes: ((None, 512, 512, 3), (None,)), types: (tf.uint8, tf.int32)> <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<>        [2021-12-07 10:42:57] ef45fdce [rank=1] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<>        [2021-12-07 10:42:57] 087c0ee7 [rank=0] || Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
<>        [2021-12-07 10:42:58] ef45fdce [rank=1] || Sequential with layers obj made
<>        [2021-12-07 10:42:58] ef45fdce [rank=1] || Wraped model in context
<>        [2021-12-07 10:42:58] ef45fdce [rank=1] || Model compiled
<>        [2021-12-07 10:42:58] ef45fdce [rank=1] || 2021-12-07 10:42:58,296:WARNING [56]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<>        [2021-12-07 10:42:58] 087c0ee7 [rank=0] || Sequential with layers obj made
<>        [2021-12-07 10:42:58] 087c0ee7 [rank=0] || Wraped model in context
<>        [2021-12-07 10:42:58] 087c0ee7 [rank=0] || Model compiled
<>        [2021-12-07 10:42:58] 087c0ee7 [rank=0] || 2021-12-07 10:42:58,413:WARNING [207]: You set shuffle=True for a tf.data.Dataset, which will be ignored. Please call .shuffle() on your dataset instead.
<>        [2021-12-07 10:42:58] 087c0ee7 [rank=0] || total batches trained: 0, workload 0% complete (0/100)
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] || 2021-12-07 10:43:00.080249: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] || 2021-12-07 10:43:00.084415: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] || Traceback (most recent call last):
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||     exec(code, run_globals)
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||     return _run_code(code, main_globals, None,
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||     sys.exit(main(args.chief_ip))
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||     controller.run()
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||     self._launch_fit()
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||     self.model.fit(
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||     tmp_logs = self.train_function(iterator)
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||     result = self._call(*args, **kwds)
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||     return self._stateless_fn(*args, **kwds)
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||     return graph_function._call_flat(
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||     return self._build_call_outputs(self._inference_function.call(
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||     outputs = execute.execute(
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] || train_function
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] || tensorflow.python.framework.errors_impl.DataLossError:  corrupted record at 0
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] ||     [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] || 
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] || Function call stack:
<>        [2021-12-07 10:43:00] ef45fdce [rank=1] || 
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] || 2021-12-07 10:43:00.214831: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] || 2021-12-07 10:43:00.219214: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2100000000 Hz
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] || Traceback (most recent call last):
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||   File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     exec(code, run_globals)
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     return _run_code(code, main_globals, None,
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||   File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 136, in <module>
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     sys.exit(main(args.chief_ip))
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 645, in run
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/exec/harness.py", line 127, in main
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     controller.run()
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     self._launch_fit()
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||   File "/run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py", line 680, in _launch_fit
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     self.model.fit(
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1095, in fit
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     tmp_logs = self.train_function(iterator)
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     result = self._call(*args, **kwds)
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     return self._stateless_fn(*args, **kwds)
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     return graph_function._call_flat(
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     return self._build_call_outputs(self._inference_function.call(
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     outputs = execute.execute(
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||   File "/opt/conda/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] || tensorflow.python.framework.errors_impl.DataLossError:  corrupted record at 0
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] || 
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] ||     [[node IteratorGetNext (defined at run/determined/pythonuserbase/lib/python3.8/site-packages/determined/keras/_tf_keras_trial.py:680) ]] [Op:__inference_train_function_1474]
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] || Function call stack:
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] || 
<>        [2021-12-07 10:43:00] 087c0ee7 [rank=0] || train_function
<>        [2021-12-07 10:43:01] 087c0ee7 || Process 1 exit with status code 1.
<>        [2021-12-07 10:43:01] 087c0ee7 || Terminating remaining workers after failure of Process 1.
<>        [2021-12-07 10:43:01] 087c0ee7 || [0]<stderr>:Terminated
<>        [2021-12-07 10:43:01] 087c0ee7 || Process 0 exit with status code 143.
<>        [2021-12-07 10:43:01] 087c0ee7 || Traceback (most recent call last):
<>        [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/bin/horovodrun", line 8, in <module>
<>        [2021-12-07 10:43:01] 087c0ee7 ||     sys.exit(run_commandline())
<>        [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 770, in run_commandline
<>        [2021-12-07 10:43:01] 087c0ee7 ||     _run(args)
<>        [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 760, in _run
<>        [2021-12-07 10:43:01] 087c0ee7 ||     return _run_static(args)
<>        [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 617, in _run_static
<>        [2021-12-07 10:43:01] 087c0ee7 ||     _launch_job(args, settings, nics, command)
<>        [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 730, in _launch_job
<>        [2021-12-07 10:43:01] 087c0ee7 ||     run_controller(args.use_gloo, gloo_run_fn,
<>        [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 706, in run_controller
<>        [2021-12-07 10:43:01] 087c0ee7 ||     gloo_run()
<>        [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/launch.py", line 722, in gloo_run_fn
<>        [2021-12-07 10:43:01] 087c0ee7 ||     gloo_run(settings, nics, env, driver_ip, command)
<>        [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 298, in gloo_run
<>        [2021-12-07 10:43:01] 087c0ee7 ||   File "/opt/conda/lib/python3.8/site-packages/horovod/runner/gloo_run.py", line 282, in launch_gloo
<>        [2021-12-07 10:43:01] 087c0ee7 ||     launch_gloo(command, exec_command, settings, nics, env, server_ip)
<>        [2021-12-07 10:43:01] 087c0ee7 ||     raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
<>        [2021-12-07 10:43:01] 087c0ee7 || RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
<>        [2021-12-07 10:43:01] 087c0ee7 || Exit code: 1
<>        [2021-12-07 10:43:01] 087c0ee7 || Process name: 1
<info>    [2021-12-07 10:43:05] 087c0ee7 || INFO: container failed with non-zero exit code:  (exit code 1)
<info>    [2021-12-07 10:43:21] ef45fdce || INFO: rpc error: code = Unknown desc = Error: No such container: 4def014024e5fa3d7cf76695ce39c4d6821b2efe6daf98a2bcdd2ee1fc8d5cc0
<info>    [2021-12-07 10:43:22] ef45fdce || INFO: container failed with non-zero exit code:  (exit code 137)
vishnu2kmohan commented 2 years ago

Hi @ramakrishnamamidi, as suggested on the Determined Community Slack thread, please first verify that the code works with a single-GPU experiment first, before you try distributed training as outlined in the debugging guide: https://docs.determined.ai/latest/training-debug/index.html

ramakrishnamamidi commented 2 years ago

Hi @vishnu2kmohan , I have deployed determined to use only cpu. My value.yaml file is as follows

# The image registry to use. Defaults to the determinedai repository in DockerHub.
imageRegistry: determinedai

# Install Determined enterprise edition.
enterpriseEdition: false

# Should be configured if using the master image in the Determined enterprise edition
# or private registry.
imagePullSecretName:

# masterPort configures the port at which the Determined master listens for connections on.
masterPort: 8080

# When useNodePortForMaster is set to false (default), a LoadBalancer service is deployed to make
# the Determined master reachable from outside the cluster. When useNodePortForMaster is set to
# true, the master will instead be exposed behind a NodePort service. When using a NodePort service
# users will typically have to configure an Ingress to make the Determined master reachable from
# outside the cluster. NodePort service is recommended when configuring TLS termination in a
# load-balancer.
useNodePortForMaster: false

# tlsSecret enables TLS encryption for all communication made to the Determined master (TLS
# termination is performed in the Determined master). This includes communication between the
# Determined master and the task containers it launches, but does not include communication between
# the task containers (distributed training). The specified Secret of type tls must already exist in
# the same namespace in which Determined is being installed.
# tlsSecret:

# db sets the configurations for the database.
db:
  # To deploy your own Postgres DB, provide a hostAddress. If hostAddress is provided, Determined
  # will skip deploying a Postgres DB.
  # hostAddress:

  # Required parameters, whether you are using your own DB or a Determined DB.
  name: determined
  user: postgres
  password: postgres
  port: 5432

  # Only used for Determined DB deployment. Configures the size of the PersistentVolumeClaim for the
  # Determined deployed database, as well as the CPU and memory requirements. Should be adjusted for
  # scale.
  storageSize: 30Gi
  cpuRequest: 2
  memRequest: 8Gi

  # useNodePortForDB configures whether ClusterIP or NodePort service type is used for the
  # Determined deployed DB. By default ClusterIP is used.
  useNodePortForDB: false

  # storageClassName configures the StorageClass used by the PersistentVolumeClaim for the
  # Determined deployed database. This can be left blank if a default storage class is specified in
  # the cluster. If dynamic provisioning of PersistentVolumes is disabled, users must manually
  # create a PersistentVolume that will match the PersistentVolumeClaim.
  # storageClassName:

# checkpointStorage controls where checkpoints are stored. Supported types include `shared_fs`,
# `gcs`, and `s3`.
checkpointStorage:
  # Applicable to all checkpointStorage types.
  saveExperimentBest: 0
  saveTrialBest: 1
  saveTrialLatest: 1

  # Comment out if not using `shared_fs`. Users are strongly discouraged from using `shared_fs` for
  # storage beyond initial testing as most Kubernetes cluster nodes do not have a shared file
  # system.
  type: shared_fs
  hostPath: /checkpoints

  # For storing in GCS.
  # type: gcs
  # bucket: <bucket_name>

  # For storing in S3.
  # type: s3
  # bucket: <bucket_name>
  # accessKey: <access_key>
  # secretKey: <secret_key>
  # endpointUrl: <endpoint_url>

  # For storing in Azure Blob Storage with a connection string.
  # Do NOT use if already using Azure Blob Storage with account URL
  # type: azure
  # container: <container_name>
  # connection_string: <connection_string>

  # For storing in Azure Blob Storage with an account URL.
  # Do NOT use if already using Azure Blob Storage with connection string.
  # The `credential` field is optional.
  # type: azure
  # container: <container_name>
  # account_url: <account_url>
  # credential: <credential>

# This is the number of GPUs there are per machine. Determined uses this information when scheduling
# multi-GPU tasks. Each multi-GPU (distributed training) task will be scheduled as a set of
# `slotsPerTask / maxSlotsPerPod` separate pods, with each pod assigned up to `maxSlotsPerPod` GPUs.
# Distributed tasks with sizes that are not divisible by `maxSlotsPerPod` are never scheduled. If
# you have a cluster of different size nodes (e.g., 4 and 8 GPUs per node), set `maxSlotsPerPod` to
# the greatest common divisor of all the sizes (4, in that case).
maxSlotsPerPod: 1

## For CPU-only clusters, use `slotType: cpu`, and make sure to set `slotResourceRequest` below.
# slotType: cpu
# slotResourceRequests:
  ## Number of cpu units requested for compute slots. Note: since kubernetes may schedule some
  ## system tasks on the nodes which take up some resources, 8-core node may not always fit
  ## a `cpu: 8` task container.
  # cpu: 7
slotType: cpu
slotResourceRequests:
  cpu: 4

# Memory and CPU requirements for the master instance. Should be adjusted for scale.
masterCpuRequest: 2
masterMemRequest: 8Gi

## Configure the task container defaults. Tasks include trials, commands, TensorBoards, notebooks,
## and shells. For all task containers, shm_size_bytes and network_mode are configurable. For
## trials, the network interface used by distributed (multi-machine) training and ports used by the
## NCCL and GLOO libraries during distributed training are configurable. These default to
## auto-discovery and random non-privileged ports, respectively.
taskContainerDefaults:
  # networkMode: bridge
  # dtrainNetworkInterface: <network interface name>
  # ncclPortRange: <MIN:MAX>
  # glooPortRange: <MIN:MAX>
  # forcePullImage: <true or false>

  # Configure a default pod spec for all GPU tasks (experiments, notebooks, commands) and CPU tasks
  # (CPU notebooks, TensorBoards, zero-slot commands). If a pod spec is defined for an individual
  # task, that pod spec will replace the default one that is defined here. See
  # https://docs:determined.ai/latest/topic-guides/custom-pod-specs.html for more details.
  # cpuPodSpec:
  # gpuPodSpec:

  # Configure default Docker images for all GPU tasks (experiments, notebooks, commands) and
  # CPU tasks (CPU notebooks, TensorBoards, zero-slot commands). If a Docker image is defined
  # for an individual task, that image will replace the default one that is defined here.
  # If specifying a default image, both GPU and CPU default images must be defined.
  # cpuImage:
  # gpuImage:

## Configure whether we collect anonymous information about the usage of Determined.
telemetry:
  enabled: true

## A user-friendly name to identify this cluster by.
# clusterName: Dev

# defaultPassword sets the password for the admin and determined user accounts.
# defaultPassword:

## Configure how trial logs are stored.
# logging:
  ## The backend to use. Can be `default` to send logs to the master to store in the PostgreSQL
  ## database or `elastic` to store logs in an Elasticsearch cluster (without going through the
  ## master).
  # type: default

  ## The remaining options should be provided only for the `elastic` backend.

  ## The host and port to use to connect to the Elasticsearch cluster.
  # host: <host>
  # port: <port>

  ## Authentication and TLS options for making the connection to Elasticsearch.
  # security:
    # username: <username>
    # password: <password>
    # tls:
      # enabled: true
      # skipVerify: false

      ## The name to use when verifying the certificate, if different from the name used to connect.
      # certificateName: <name>

      ## This value must contain the contents of the certificate file, not a path. It may be set
      ## directly or using `helm install --set-file logging.security.tls.certificate=<path>`.
      # certificate: <certificate contents>

## Configure the default Determined scheduler
## Currently supports "coscheduler" for gang scheduling and "preemption" for priority based
## scheduling with preemption
# defaultScheduler: preemption

I want to use slots_per_trial to throttle number of pods and check the timing of training a DL model.

vishnu2kmohan commented 2 years ago

Hi @ramakrishnamamidi is it safe to close this issue now that it has been resolved by re-uploading your dataset?

vishnu2kmohan commented 2 years ago

Please reopen if necessary.