Caching Bottlenecks - Githubissues

PaulWoitaschek commented 5 years ago

Thanks for you great blog post on transfer learning using the extimator API! It really helped me to understand how to use it.

When following the example from tensorflow hub: retrain.py

What they did differently here is that they cache the bottlenecks on each image first and store them as text files, specific for the module_spec it they were run with. cache_bottlenecks

It's really great because creating these bottlenecks just has to be done once. Subsequent trainings are then way quicker and you can quickly experiment as only the actual training has to be done.

Could you show an example how to do that using the high level functions you are using?

damienpontifex commented 5 years ago

@PaulWoitaschek I wanted to do this and went for the route of getting it working first.

Thanks for bringing this up as it might kick me back into looking at again

PaulWoitaschek commented 5 years ago

Great! Can you tell me what your approach would be? I'm trying to implement it but unfortunately I only understand half of what I'm doing.

damienpontifex commented 5 years ago

@PaulWoitaschek need some more time to actually evaluate my current work. But here's what I'm thinking/doing:

Use sess.run to get float values of final layer of resnet model and persist these alongside their labels.
```
def _float_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
```

def _bytes_feature(value): return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

all_train_files = list(map(lambda f: str(f), (data_dir / 'train').glob('*/.jpg'))) shards = np.array_split(np.array(all_train_files), 10)

with tf.Graph().as_default(): module = hub.Module('https://tfhub.dev/google/imagenet/resnet_v2_50/feature_vector/1', trainable=False, name='resnet_v2_50')

input_size = hub.get_expected_image_size(module)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    for index, shard in enumerate(shards):
      print(f'Processing shard: {index}')
      batch_size = 16
      sample_index = 0

      imgs, labels = make_input_fn(shard, num_epochs=1, batch_size=batch_size, image_size=input_size)().make_one_shot_iterator().get_next()
      bottlenecks = module(imgs['image'])

      with tf.python_io.TFRecordWriter(str(data_dir / f'train/bottleneck-{index}.tfrecord')) as writer:
        p = display(progress(0, len(shard)), display_id=True)

        while True:
            try:
                img_batch, label_batch = sess.run([bottlenecks, labels])

                for bottleneck, label in zip(img_batch, label_batch):

                    example = tf.train.Example(features=tf.train.Features(feature={
                        'bottleneck': _float_feature(bottleneck),
                        'label': _bytes_feature(label)
                    }))
                    writer.write(example.SerializeToString())

                sample_index += batch_size
                p.update(progress(sample_index, len(shard)))

            except tf.errors.OutOfRangeError:
                # Thrown when tf.data has completed
                break


2. Change the retraining to use these TFRecord files as the input. Samples are now 2048 float values
```python
def bottleneck_data_fn(file_pattern, shuffle=False, batch_size=64, num_epochs=None):
  def _input_fn():
    dataset = tf.contrib.data.make_batched_features_dataset(
        file_pattern,
        batch_size,
        {
            'bottleneck': tf.FixedLenFeature([2048], tf.float32),
            'label': tf.FixedLenFeature([], tf.string)
        },
        shuffle=shuffle,
        num_epochs=num_epochs
    )

    transformed_features = dataset.make_one_shot_iterator().get_next()

    transformed_labels = transformed_features.pop('label')

    return transformed_features, transformed_labels

  return _input_fn

def bottleneck_model_fn(features, labels, mode, params):
    is_training = mode == tf.estimator.ModeKeys.TRAIN

    NUM_CLASSES = len(params['label_vocab'])

    bottleneck_tensor = features['bottleneck']

    with tf.name_scope('final_retrain_ops'):
        logits = tf.layers.dense(bottleneck_tensor, units=1, trainable=is_training)

    def train_op_fn(loss):
        optimizer = tf.train.AdamOptimizer(learning_rate=params['learning_rate'])
        return optimizer.minimize(loss, global_step=tf.train.get_global_step())

    if NUM_CLASSES == 2:
        head = tf.contrib.estimator.binary_classification_head(label_vocabulary=params['label_vocab'])
    else:
        head = tf.contrib.estimator.multi_class_head(n_classes=NUM_CLASSES, label_vocabulary=params['label_vocab'])

    return head.create_estimator_spec(
        features, mode, logits, labels, train_op_fn=train_op_fn
    )

def train(_):

    run_config = tf.estimator.RunConfig()

    _dir = get_data(data_directory, run_config.is_chief)

    params = {
        'learning_rate': 1e-3,
        'label_vocab': ['dogs', 'cats']
    }

    classifier = tf.estimator.Estimator(
        model_fn=bottleneck_model_fn,
        model_dir=str(model_directory),
        config=run_config,
        params=params
    )

    train_input_fn = bottleneck_data_fn(str(data_dir / 'train/bottleneck-*.tfrecord'), shuffle=True, num_epochs=1)
    classifier.train(train_input_fn, max_steps=2000)

train(None)

Seems to be working now. This doesn't allow data augmentation or fine tuning the module easily.

Next step is to merge the module graph with the minimal retraining graph here to form the entire graph again for fine tuning and exporting. Will continue to look into it this week and cleanup code etc etc so I understand it better :)

I realise this was a bit of a 'dump', but wanted to get some progress down and see if you had any ideas or this helped to start

PaulWoitaschek commented 5 years ago

Thanks!

I tried a bit different approach. I tried to follow the tensorflow tutorial and saved the bottlenecks like this:

def add_jpeg_decoding(module_spec):
    input_height, input_width = hub.get_expected_image_size(module_spec)
    input_depth = hub.get_num_image_channels(module_spec)
    jpeg_data = tf.placeholder(tf.string, name='DecodeJPGInput')
    decoded_image = tf.image.decode_jpeg(jpeg_data, channels=input_depth)
    # Convert from full range of uint8 to range [0,1] of float32.
    decoded_image_as_float = tf.image.convert_image_dtype(decoded_image,
                                                          tf.float32)
    decoded_image_4d = tf.expand_dims(decoded_image_as_float, 0)
    resize_shape = tf.stack([input_height, input_width])
    resize_shape_as_int = tf.cast(resize_shape, dtype=tf.int32)
    resized_image = tf.image.resize_bilinear(decoded_image_4d, resize_shape_as_int)
    return jpeg_data, resized_image

def bottleneck_file_path(image_path):
    bottleneck_dir = os.path.join(BOTTLENECK_DIR, MODULE_URL.replace('://', '_').replace('/', '_').replace(':', '_'))
    label = os.path.basename(os.path.dirname(image_path))
    label_dir = os.path.join(bottleneck_dir, label)
    return os.path.join(label_dir, os.path.basename(image_path) + '.txt')

def cache_bottlenecks(params):
    all_train_files = glob.glob(IMAGE_DIR + '/*/*')
    image_count = all_train_files.__len__()

    with tf.Graph().as_default():
        module = hub.Module(params['module_spec'])
        width, height = hub.get_expected_image_size(module)
        resized_input_tensor = tf.placeholder(tf.float32, [None, width, height, 3])
        bottleneck_tensor = module(resized_input_tensor)

        jpeg_data_tensor, decoded_image_tensor = add_jpeg_decoding(module)

        with tf.Session() as sess:
            sess.run(tf.global_variables_initializer())

            for index, image in enumerate(all_train_files):
                if index % 250 == 0:
                    print(str(index) + '/' + str(image_count) + ' bottlenecks created')
                output_file = bottleneck_file_path(image)

                if os.path.exists(output_file):
                    continue

                image_data = tf.gfile.FastGFile(image, 'rb').read()
                resized_input_values = sess.run(decoded_image_tensor,
                                                {jpeg_data_tensor: image_data})
                bottleneck_values = sess.run(bottleneck_tensor,
                                             {resized_input_tensor: resized_input_values})
                bottleneck_values = np.squeeze(bottleneck_values)

                ensure_dir_exists(os.path.dirname(output_file))
                bottleneck_string = ','.join(str(x) for x in bottleneck_values)
                with open(output_file, 'w') as bottleneck_file:
                    bottleneck_file.write(bottleneck_string)

Now I try to read these values from within the dataset:

def make_input_fn(files, image_size, img_channels, shuffle=False, batch_size=64, num_epochs=None, buffer_size=4096):
    bottleneck_paths = list(map(bottleneck_file_path, files))

    def _path_to_img(path):
        # Get the parent folder of this file to get it's class name
        label = tf.string_split([path], delimiter='/').values[-2]

        file = tf.read_file(path)
        bottleneck_values = tf.string_split([file], delimiter=',').values
        bottleneck_values = tf.string_to_number(bottleneck_values, tf.float32)
        bottleneck_values = tf.reshape(bottleneck_values, [image_size[0], image_size[1], img_channels])

        return {'image': bottleneck_values}, label

    def _input_fn():
        dataset = tf.data.Dataset.from_tensor_slices(bottleneck_paths)

        if shuffle:
            dataset = dataset.apply(tf.contrib.data.shuffle_and_repeat(buffer_size, num_epochs))
        else:
            dataset = dataset.repeat(num_epochs)

        dataset = dataset.map(_path_to_img, num_parallel_calls=os.cpu_count())
        dataset = dataset.batch(batch_size).prefetch(buffer_size)

        return dataset

    return _input_fn

However that fails after some time:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 1280 values, but the requested shape has 150528
     [[Node: Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/device:CPU:0"](StringToNumber, Reshape/shape)]]
     [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[?,224,224,3], [?]], output_types=[DT_FLOAT, DT_STRING], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]
     [[Node: IteratorGetNext/_67 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_272_IteratorGetNext", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Do you understand why that happens?

damienpontifex commented 5 years ago

Hmm I haven't run your code yet, but I usually see this type of error when either

dtype isn't correct so I think it reads in all the bytes into the wrong bit sized type
A set shape or reshape operation doesn't take into considering batch dimension (although the numbers given aren't a multiple of each other, so maybe not)

Sorry can't provide more insight than this

PaulWoitaschek commented 5 years ago

Thanks. Could you try the actual code? I've spent really too much time on this tumbeling in the dark :/

damienpontifex / BlogCodeSamples

Caching Bottlenecks #5