Optimizing input pipeline

AdityaKane2001 commented 3 years ago

@sayakpaul @MorganR

The issue with the profiler was resolved about a day ago. Since then, I've been trying to optimize the input pipeline. Following are my observations:

Our current input pipeline is very slow. It takes anywhere between 700,000 us to 1,000,000 us to load a batch of 128 images. Probable reasons in the following points. Recommended here is 50 us.
RandAugment takes the most time. If that is removed, the average time drops to 250,000 us.
Unfortunately, we cannot vectorize the entire pipeline. All of our preprocessing functions (randaugment,random_sized_crop,scale_and_center_crop) take single images. The tf functions used themselves take only one image at a time.
I was doing all of this on Colab. Perhaps this may be faster on GCP.

Things I tried:

Using interleave for TFRecords I/O instead of map.
Using uint8 for images instead of float32.
Removing cast ops. There are no cast ops in the preprocessing code now.

I tried removing num_parallel_calls in I/O functions but it increased the time taken.

These changes reduced the time by 200,000 us. These changes were based on guides by TensorFlow (here, here). I used the methods which were relevant.

Lastly, is are these times normal for image datasets? I figured since the training will be on TPUs, the input pipeline should be aggressively optimized. Please correct me if I'm wrong. Is there something that I'm completely missing? Frankly, I have a feeling that I'm blindsided by something very obvious.

Attaching herewith the zip file for all logs of my last run before posting this issue, in case you want to take a look at the actual numbers. last_run_logs.zip

sayakpaul commented 3 years ago

The issue with the profiler was resolved about a day ago.

What was the problem?

Following are my suggestions:

We are parsing images of 512x512 resolution which definitely is introducing added bottleneck.
You don't need to operate on float32 tensors. Because we will be including our normalization/scaling layer inside our model itself. Note that ops like tf.image.resize() cast your input tensor to float32, so it's good to cast them back to int.

Try order-independent loading:

options = tf.data.Options()
options.experimental_deterministic = False

dataset = tf.data.TFRecordDataset(...)
dataset = dataset.with_options(options)

For more refer here.

What happens when you load the TFRecords from a GCS Bucket? In my experience, I have often seen this to be faster than loading locally.

I agree that mapping functions to a batch of data is always helpful and it can drastically improve performance. Another experiment you could perform is caching the expensive operations. But it's NOT recommended to cache the operations that are supposed to be stochastic in nature, RandAugment, random_sized_crop, for example.

Frankly, I have a feeling that I'm blindsided by something very obvious.

You are doing a great job and believe me this is not very trivial to get an input pipeline that is highly optimized.

Let me know if anything is unclear.

AdityaKane2001 commented 3 years ago

What was the problem?

This was the issue.

We are parsing images of 512x512 resolution which definitely is introducing added bottleneck.

Actually, we have images that are not of any particular size. We decided on this a couple of weeks ago in our meet.

About uint8 and non deterministic outputs, I'm implementing those.

What happens when you load the TFRecords from a GCS Bucket?

I haven't done that since I've not setup the GCS config. But I am also of the opinion that GCS with our VM will be better.

I agree that mapping functions to a batch of data is always helpful and it can drastically improve performance.

Is there any way we can do this? As I mentioned earlier, our functions don't support batches.

Another experiment you could perform is caching the expensive operations

Could you please give an example? As far as I understand, all of the preprocessing functions are stochastic in nature.

sayakpaul commented 3 years ago

Actually, we have images that are not of any particular size. We decided on this a couple of weeks ago in our meet.

How come you are serializing multiple images in one shard then?

Could you please give an example? As far as I understand, all of the preprocessing functions are stochastic in nature.

See this section on caching.

AdityaKane2001 commented 3 years ago

How come you are serializing multiple images in one shard then?

We read the file from the input stream, do basic checks and then re-encode it as JPEG. The encoding returns a byte string which is then stored in BytesList. Code here.

See this section on caching.

I'll take a look.

sayakpaul commented 3 years ago

Nice. This is indeed learning for me. And it also suggests the bottleneck inside the input pipeline. When you have extremely uniform shapes (consider having individual entries in the shards having a shape of 224x224x3) you get more parallelization benefits. You might wanna try this out to confirm.

Take 500 images, serialize them with a fixed shape like 224x224x3, and then investigate.

AdityaKane2001 commented 3 years ago

Take 500 images, serialize them with a fixed shape like 224x224x3, and then investigate.

I resized the images to 512x512 and stored them. The time required for one batch is now between 60 us to 100us, which is way better. I think reducing the range of aspect ratios now would be a good idea, as the image will be most likely be distorted initially.

I'll try the rest of improvements and see if there is any additional improvement. I'll commit the changes along with #3 . I'll request a review when I make those changes.

sayakpaul commented 3 years ago

Sure. While doing initial resizing prefer smart_resize. Playing with different aspect ratio scales is also a good option.

I'll try the rest of improvements and see if there is any additional improvement. I'll commit the changes along with #3 . I'll request a review when I make those changes.

Sounds good to me.

Overall, I think we both got to learn how to take informed decisions to debug an input pipeline that is truly practically relevant :)

AdityaKane2001 commented 3 years ago

@sayakpaul

I resized the images to 512x512 and stored them. The time required for one batch is now between 60 us to 100us

Sorry, but the previous results were incorrect as they were on a CPU. For some reason, if we run the model on a CPU, the Wall time is extremely low, irrespective of the batch size. Same model executed on GPU, the wall time increases proportional to the batch size. The self time remains almost same (<50us) in all cases.

Essentially, all the things I've tried until now reduced the time marginally. From here I have 2 things:

Is Wall time the right measure? This will increase linearly as the batch size increases.
Is the excess time due to host to device communication? As I said earlier the CPU times are very low.

A side observation, that he time for a single batch is inconsistent.

If the batch size is 1024 (which we are going to use for the final implementation) we experience very large times, and the wall time is high.
If the batch size is 128 we experience low times for first couple of batches and then again very high times.
If the batch size is 1, we experience low times for all "batches".

Sorry again about the misinformation. I'll take care about this henceforth.

sayakpaul commented 3 years ago

Is the excess time due to host to device communication? As I said earlier the CPU times are very low.

Yes. This is why we prefetch a couple of batches well before the current epoch finishes. Increasing the batch size should generally improve the throughput.

A TPU-v3-8 has eight workers. If we use a batch size of 1024, it will be evenly distributed across these workers meaning each worker will operate on 1024//8 samples (each sample being 512x512x3 in shape). So, you might want to account for this and benchmark timings for 1024//8 samples. Also, please start using a GCS Bucket now 'cause it usually results in a speedup.

For your query on wall-time, I would like to ask @MorganR to chime in.

AdityaKane2001 commented 3 years ago

Yes. This is why we prefetch a couple of batches well before the current epoch finishes.

Got it.

So, you might want to account for this and benchmark timings for 1024//8 samples.

Shouldn't we consider 1024 samples, as the TPU will divide them into 8 workers, but the CPU will have to preprocess all of them beforehand?

Also, please start using a GCS Bucket now 'cause it usually results in a speedup.

Okay, I'll make a small bucket for Imagenette and its TFRecords. Just to confirm, I opt for us-* regions, right? Since Colab (or any of our future VMs) will be in us-* network itself?

sayakpaul commented 3 years ago

but the CPU will have to preprocess all of them beforehand?

Actually no. Each worker operates on the batch it receives.

Just to confirm, I opt for us- regions, right? Since Colab (or any of our future VMs) will be in us- network itself?

Ideally, the bucket should be created in the same region as the TPU. I'm not sure where Colab TPUs are located. But when we'd do this with Cloud TPUs we'd have better clarity on this. But this latency is very minimal so we can ignore it for now I think.

AdityaKane2001 commented 3 years ago

Actually no. Each worker operates on the batch it receives.

Understood.

But this latency is very minimal so we can ignore it for now I think.

Okay.

Thanks a lot for today, will work on this tomorrow and let you know the results.

AdityaKane2001 commented 3 years ago

@sayakpaul

Today I experimented quite a bit with all possible things we can do to improve the speed, but I couldn't get anything more than a minor speedup. The pipeline now requires ~800 to 900 ms for one batch of 128 (on GPU), as opposed to yesterday's 1000ms.

Following is the summary of today's experiments:

I used a GCS bucket to store the generated TFRecords. That provided a small improvement, albeit minimal. I tried to profile both GPU and TPU. I couldn't get actual latency numbers (the TPU profiler omitted the host statistics), but the TPU was used <1%, so that confirms the observations on GPU. On GPU, the numbers are same as before.
I used tf.data.Options() and its subclasses. I kept the AUTOTUNE as it is, and overrode some values but they didn't improve the performance. Actually, they deteriorated it by a substantial amount (1.2 - 1.5x time required).
As expected only reading and parsing TFRecords is very fast. Weirdly, the time required for applying only RandAugment is three times the time required to apply random_crop + RandAugment. I double-checked this, but yes, this was indeed the case.
The pipeline uses caching, prefetching and interleave. Caching is done just after parsing TFRecords. Prefetching at the end of the pipeline. Interleave uses deterministic = False.

The only thing that's not implemented is vectorization. I think it may give some improvement, but we'd need to vectorize all functions, including randaugment. Apart from that, I've tried everything in the book.

/cc @MorganR

sayakpaul commented 3 years ago

What happens when we vectorize the augmentation functions except for RandAugment? We can create a Lambda layer for RandAugment and probably then it will allow vectorization. But this option often increases the latency.

AdityaKane2001 commented 3 years ago

What happens when we vectorize the augmentation functions except for RandAugment?

I'll take a look into that. Is it okay if I give an update on this by today evening? I'll try to consolidate as much as I can and get back to you.

I also wanted to give an update about model and training scripts. The model scripts are about 70% done, and I'll do the training script soon. Since the training setup is fairly straightforward, I don't think we'll need to write a custom training loop. I'll open a PR as soon as I have completed the model scripts, and a PR later for the training script. Sounds good?

sayakpaul commented 3 years ago

Is it okay if I give an update on this by today evening? I'll try to consolidate as much as I can and get back to you.

Sure. No problem.

For the subsequent goals, I would suggest we first get through the model implementation first. Once that's complete we proceed toward training it. For training, we may want to discuss what callbacks to use, LR schedules, etc. The training should ideally be done with factory model.compile() and model.fit(). Here's an example.

AdityaKane2001 commented 3 years ago

@sayakpaul

Sorry for the delay in response.

Today's summary:

I vectorized random_sized_crop. The vectorization did not give a speedup as we thought. I strongly think that my implementation can be improved a lot, but I couldn't find a better way for that as of now. More on that below.
The bottleneck is without a doubt network latency. Loading TFRecords form Buckets, even with prefetch and interleave, takes long time. I hope this will not be an issue when we're using Cloud TPUs, but for now it's definitely an issue.
As mentioned earlier, RandAugment is not vectorizable. Barring that, the rest of the (vectorized) pipeline is better than yesterday, but far from optimal.

Updates on model scripts:

Today I completed the model initialization script. Now the Regnets can be randomly initialized.
I added a small test script for individual blocks. The script runs only on GPU/TPU.

Following is my implementation of random_sized_crop (vectorized)

 def _get_boxes(self, aspect_ratio, area):
        """
        Returns crop boxes to be used in crop_and_resize
        """
        heights = tf.random.uniform((self.batch_size,), 
            maxval = tf.math.sqrt(area) * aspect_ratio )
        widths = heights / tf.math.square(aspect_ratio)

        if tf.random.uniform(()) < 0.5 :
            temp = heights
            heights = widths
            widths = temp

        else:
            temp = heights #for AutoGraph

        max_width = tf.math.reduce_max(widths)
        max_height = tf.math.reduce_max(heights)

        x1s = tf.random.uniform((self.batch_size,), minval = 0, maxval = max_width/2 - 0.00001)
        y1s = tf.random.uniform((self.batch_size,), minval = 0, maxval = max_height/2 - 0.00001)

        x2s = widths + x1s
        y2s = heights + y1s

        x2s = tf.clip_by_value(x2s, clip_value_min=0, clip_value_max=1.0)
        y2s = tf.clip_by_value(y2s, clip_value_min=0, clip_value_max=1.0)

        boxes = tf.stack([y1s, x1s, y2s, x2s])

        boxes = tf.transpose(boxes)

        return boxes

    @tf.function
    def random_sized_crop(self, 
        example: dict,
        min_area: float = 0.08) -> dict:
        """
        Takes a random crop of image having a random aspect ratio. Resizes it 
        to self.image_size. Aspect ratio is NOT maintained. 

        Args:
            example: A dataset example dict.
            min_area: Minimum area of image to be used

        Returns:
            Example of same format as _TFRECS_FORMAT
        """

        image = example['image']
        h = example['height']
        w = example['width']

        aspect_ratio = tf.random.uniform((), minval = 3./4., maxval = 4./3.)
        area = tf.random.uniform((), minval = min_area, maxval = 1)

        boxes = self._get_boxes(aspect_ratio, area)

        image = tf.image.crop_and_resize(
            image,
            boxes,
            tf.range(self.batch_size),
            (self.image_size[0], self.image_size[0]),
        )

        return {...}

There are 2 main issues with this:

This does not guarantee a large speedup due to multiple random ops.
This does not always adhere to area and aspect_ratio, albeit by a small measure for a small number of cases.

That's all I have for today. Please share your thoughts on the input pipeline and the implementation.

Thanks.

MorganR commented 3 years ago

Without seeing your TensorBoard profile, it's hard to debug specifically where time is being taken. One thing to double check is if the input pipeline is running in eager mode? As Sayak mentioned, it's best to use model.compile() and model.fit(), since these will use tf.function to make sure things are being optimized a bit.

How big are your input TFRecords (i.e. how many MB is each file)?

It looks like your pipeline is roughly this:

dataset = (tf.data.Dataset.list_files('/path')
    .interleave(tf.data.TFRecordDataset, 
        num_parallel_calls = tf.data.AUTOTUNE,
        deterministic=False)
    .map(
        lambda example: tf.io.parse_example(example, _TFRECS_FORMAT),
        num_parallel_calls = tf.data.AUTOTUNE)
    .map(
        lambda example: self.decode_example(example), 
        num_parallel_calls = tf.data.AUTOTUNE)
    .cache()
    .map(
        lambda example: self.random_sized_crop(example),
        num_parallel_calls = tf.data.AUTOTUNE)
    .map(
        lambda example: self._randaugment(example),
        num_parallel_calls = tf.data.AUTOTUNE)
    .map(
        lambda example: self._one_hot_encode_example(example),
        num_parallel_calls = tf.data.AUTOTUNE)
    .batch(self.batch_size)
    .prefetch(tf.data.AUTOTUNE)
)

Can you provide some profiling screenshots from TensorBoard? That would help me to suggest useful improvements. In the meantime, I'd recommend a few things:

dataset = (tf.data.Dataset.list_files('/path')
    # Suggestion: try only verifying a small set of files while you test so you can actually verify multiple epochs of this pipeline.
    # See the note on caching below.
    .take(10)
    .interleave(tf.data.TFRecordDataset, 
        num_parallel_calls = tf.data.AUTOTUNE,
        deterministic=False)
    .map(
        lambda example: tf.io.parse_example(example, _TFRECS_FORMAT),
        num_parallel_calls = tf.data.AUTOTUNE)
    # I'm guessing here, but the lambdas might be affecting your function tracing. This doesn't depend on self,
    # so try moving the function outside of the class and dropping the lambda. Try doing the same for the other
    # map functions.
    .map(
        decode_example, 
        num_parallel_calls = tf.data.AUTOTUNE)
    # Note that unless you're iterating over the whole dataset, you're not yet seeing the benefits of this cache.
    # You'll only notice an improvement on the second epoch (of the whole dataset). Using `take` above lets us validate
    # this more quickly.
    .cache()
    # Does your batch size fit perfectly into the number of elements? If not, try setting drop_remainder=True in the batch call.
    # This might improve shape propagation and related optimizations. 
    # Do a batch here so the map calls are operating on multiple images at once. Based on your debugging, that sounds likely to 
    # be related. Tweak the size as necessary.
    .batch(self.batch_size)
    .map(
        random_sized_crop,
        num_parallel_calls = tf.data.AUTOTUNE)
    .map(
        lambda example: self._randaugment(example),
        num_parallel_calls = tf.data.AUTOTUNE)
    .map(
        lambda example: self._one_hot_encode_example(example),
        num_parallel_calls = tf.data.AUTOTUNE)
    # If the `batch` call above uses a smaller batch, you can add another batch here to get to your desired batch size.
    # .batch(2)
    .prefetch(tf.data.AUTOTUNE)
)

MorganR commented 3 years ago

@AdityaKane2001 did you already change the images to all have the same height and width before saving them as TFRecords? I don't think you'll be able to efficiently batch your preprocessing operations until you do this.

sayakpaul commented 3 years ago

I added a small test script for individual blocks. The script runs only on GPU/TPU.

Anywhere else it's supposed to run? Or did you miss out on something in your statement?

The bottleneck is without a doubt network latency. Loading TFRecords form Buckets, even with prefetch and interleave, takes long time. I hope this will not be an issue when we're using Cloud TPUs, but for now it's definitely an issue.

Take a look at this Colab Notebook in GPU/CPU mode. It loads TFRecords from a public GCS Bucket and it does that pretty fast. So, not very sure what's going on with the pipeline here.

Regarding vectorizing, random_sized_crop(), I'd suggest swapping it with keras.layers.experimental.preprocessing.RandomCrop and seeing if we are getting any performance benefits. I am aware this layer does not incorporate the scale aspect but still, it gets us closer to what we want.

AdityaKane2001 commented 3 years ago

@MorganR

Here and here's the CPU trace and here's the GPU trace. Please tell me if there are any important parts missing.

How big are your input TFRecords (i.e. how many MB is each file)?

About 80 to 100 MB each.

did you already change the images to all have the same height and width before saving them as TFRecords?

Yes, I did. Code here.

If I understand correctly, following is the current pipeline (my observations are wrt this one):


ds = (tf.data.Dataset.list_files('gs://adityakane-imagenette-tfrecs')
    .interleave(tf.data.TFRecordDataset, 
        num_parallel_calls = tf.data.AUTOTUNE,
        deterministic=False)
    .map(
        lambda example: tf.io.parse_example(example, _TFRECS_FORMAT),
        num_parallel_calls = tf.data.AUTOTUNE)
    .map(
        lambda example: self.decode_example(example), 
        num_parallel_calls = tf.data.AUTOTUNE)
    .batch(self.batch_size)
    .map(
        lambda example: self.random_sized_crop(example),
        num_parallel_calls = tf.data.AUTOTUNE)
    # No randaugment here for now as it is not vectorized
    .map(
        lambda example: self._one_hot_encode_example(example),
        num_parallel_calls = tf.data.AUTOTUNE)
    .prefetch(tf.data.AUTOTUNE)
)

I'll remove the lambdas wherever possible and replace them with global functions. This may not be relevant, but does the dict with all the other attributes cause a slowdown?

@sayakpaul

Anywhere else it's supposed to run? Or did you miss out on something in your statement?

Nope, nowhere else. Just mentioned for clarity, as grouped convs are not supported on CPUs.

It loads TFRecords from a public GCS Bucket and it does that pretty fast.

Not sure about this, but I don't think it is possible to say with certainty that the pipeline is fast without profiling. Also, it applies minimal augmentation (random flip and random saturation). Please correct me if I'm wrong.

I'd suggest swapping it with keras.layers.experimental.preprocessing.RandomCrop

Sure, I'll give it a try.

sayakpaul commented 3 years ago

Nope, nowhere else. Just mentioned for clarity, as grouped convs are not supported on CPUs.

Not sure about the validity of this. Here's an example of a ResNext block and it uses group convs. Runs perfectly fine on a CPU.

from tensorflow.keras.layers import *
import tensorflow as tf

# Reference:
# https://livebook.manning.com/book/deep-learning-design-patterns/chapter-6/v-5/174
def resnext_block(x, filters_in=32, cardinality=4):
    shortcut = x

    # (1) 1x1 bottleneck convolution
    x = Conv2D(filters_in, (1, 1), strides=(1, 1), padding='same')(shortcut)
    x = BatchNormalization()(x)
    x = ReLU()(x)
    # (2) Split & Transform
    filters_card = filters_in // cardinality
    groups = []
    for i in range(cardinality):
        group = Lambda(lambda z: z[:, :, :, i * filters_card:i *
                                filters_card + filters_card])(x)
        groups.append(Conv2D(filters_card, (3, 3), strides=(1, 1),
                            padding='same')(group))
    # (3) Merge
    x = Concatenate()(groups)
    x = BatchNormalization()(x)
    x = ReLU()(x)
    # Dimensionality restoration
    x = Conv2D(filters_in, (1, 1), strides=(1, 1), padding='same')(x)
    x = BatchNormalization()(x)
    # (4) Scale
    x = Add()([shortcut, x])
    x = ReLU()(x)
    return x

inputs = Input((224, 224, 32))
x = resnext_block(inputs)
outputs = Dense(10)(x)
model = tf.keras.Model(inputs, outputs)

Not sure about this, but I don't think it is possible to say with certainty that the pipeline is fast without profiling. Also, it applies minimal augmentation (random flip and random saturation). Please correct me if I'm wrong.

The comment on the speed of the pipeline was based on my observation of running that code like 1000 times. So, yes, I don't have a calibrated probability estimate to make a speed claim but in this case, I'll definitely trust my observation.

That said, you're totally right that the pipeline does not make use of heavy augmentation like ours. Also, the size of an individual entry inside any given shard is way lesser than ours. My point was loading from a public GCS bucket has never been a slow experience for me.

Since RandAugment is likely the root of all evil here, I have the following suggestions:

As per this line we are applying 2 layers of RandAugment with a strength of 5. Let's interpolate between 2 - 5 (for strength) and see if we get improvements for lower strengths. Chances are that we will.
@MorganR do you think we should flag this to the Model Garden folks to get any performance optimization suggestions specifically for RandAugment?

AdityaKane2001 commented 3 years ago

example of a ResNext block and it uses group convs.

Sorry, I meant using the groups = ... argument of Conv2D as it is not supported on a non-XLA CPU. In that case, should I change the current code to use such explicitly defined groups?

sayakpaul commented 3 years ago

Since it's unlikely anyone's gonna train a CPU, I think we can just comment in the code to denote the limitation.

AdityaKane2001 commented 3 years ago

Could we implement a weaker RandAugment, something like the code below? That along with some ops from TF Addons can be used. We can cover most augmentations that way, and everything will be vectorized. We will then have the flexibility to remove augmentations that are known to show minimal improvements on small datasets (CIFAR 10), eg Solarize, Posterize.

# A crude implementation

def get_augment_list():
    return np.array(list(map(lambda x:x<0.5,np.random.random(size=6))),dtype='bool')

@tf.function
def resize_image(image):
    return tf.cast(tf.image.resize(image,[IMG_SIZE,IMG_SIZE]),tf.float16)

@tf.function
def augment_img_randomly(img):
    '''
    Augmentaions to be used:

    Random hue (0.2)
    Random brightness (0.3)
    Random saturation (0.7,1.3)
    Random contrast  (0.8,1.2)
    ''' 
    augment_list = get_augment_list()
    image = resize_image(img)
     #(32,512,512,3)

    if augment_list[0]:
        image = tf.image.random_saturation(image,0.7,1.3)
    if augment_list[1]:
        image = tf.image.random_contrast(image,0.8,1.2)
    if augment_list[2]:
        image = tf.image.random_brightness(image,0.3)
    if augment_list[3]:
        image = tf.image.random_hue(image,0.2)
    if augment_list[4]:
        image = tf.image.random_flip_left_right(image)
    if augment_list[5]:
        image = tf.image.random_flip_up_down(image)

    image = tf.math.divide(image,255)

    return image

sayakpaul commented 3 years ago

get_augment_list() this one won't fit inside a TPU (non-TF ops) and it has downsides. tf.function will trace it for only once and as a result, you'll get the same value it had picked up during the initial tracing. So, this is not truly random (I understand nothing is truly random when it comes to code, it's all pseudo).

You can refer to this blog post and take a look at the augmentations used there. It also closely resembles the strong data augmentation pipelines typically used to train self-supervised vision models.

AdityaKane2001 commented 3 years ago

tf.function will trace it for only once

Yes, I'm aware of that, just wanted to illustrate the idea quickly. Is it okay if we try this?

You can refer to this blog post and take a look at the augmentations used there.

Yes, same idea, but I want to add more augmentations using tfa to more or less mimic RandAugment.

sayakpaul commented 3 years ago

Go ahead. But all the random ops should purely be based on native TF. You may find this set of utilities to be a relevant reference as well.

MorganR commented 3 years ago

Thanks for the traces, Aditya. Based on the CPU traces, the GCS reads aren't an issue, as you can see things are being appropriately prefetched, and the reads are not taking the most time. How er, the map calls are clearly slowing it down.

Can you give a magnified view of the input ops on the GPU trace? From that image, it looks like most time is spent training, and it's not clear that the input latency is a problem. Could you clarify?

Do you mean that each individual example is a dict inside your dataset? I am not certain, but I would guess that this cannot be optimized as well as if they were raw tensors. I'd definitely recommend adding an op right after decoding, before the cache, that converts these to raw tf Tensors. Then update your map calls to work with raw tensors. I wonder if this will improve RandAugment performance too. I'll see if I can learn anything else today re: improving that performance.

On Tue, 29 Jun 2021, 07:04 Sayak Paul, @.***> wrote:

Go ahead. But all the random ops should purely be based on native TF.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AdityaKane2001/regnety/issues/4#issuecomment-870237741, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZIOLDEF4JM6XOUEW7IUJ3TVFH4VANCNFSM47LCF7GA .

AdityaKane2001 commented 3 years ago

@MorganR

Here are the traces for two batches on GPU: trace1, trace2

trace1 is of the first batch of training - extremely fast. The gradient tape and backprop are comparable to the whole train step. trace2 is of 2nd epoch, 8th batch - not very fast. The same ops are considerably slower compared to the whole training step.

I have a small question regarding reading these trace views. My understanding is, the larger block (say IteratorGetNext::DoCompute in CPU trace) denotes the time interval between when the batch was called and the batch was returned. All processes in parallel threads to this one, contained in that interval, correspond to preparing this batch. Once all preprocessing is done, the batch is returned and the IteratorGetNext::DoCompute block ends. Is this correct?

Based on this, I concluded that the LoadFromGCSBuffer was taking half the time in the block

and thus network latency was the problem. And yes, input pipeline is definitely half of the problem (CropAndResize block and numerous small lines denoting DecodeJPEG).

Same is the case with GPU. Most of the sequential block does not have any ops in parallel. Thus, I thought that most of the time is spent in waiting for input.

Attaching herewith the logs for this run. I'll make a notebook and put these logs in there, so you can see them on TensorBoard. reference_logs.zip

AdityaKane2001 commented 3 years ago

@MorganR

Here is the notebook. You'll need to execute it, as the logs zip is fetched remotely. Attached are the logs used in this notebook. logs.zip

AdityaKane2001 commented 3 years ago

Just an update, ImageNet train, val and test sets are downloaded. Surprisingly, the entire download completed in just over three hours. Now working on WeakRandAugment. Will complete that by tomorrow evening and share the results.

Thanks.

AdityaKane2001 commented 3 years ago

@sayakpaul @MorganR

I've implemented WeakRandAugment. It has color_jitter, cutout, invert, rotate and solarize augmentations. These were the only ones that executed in vectorized manner. Most of others in tfa didn't work because they expected the shape to be fully known during their execution.
Also, I've removed random_sized_crop from the pipeline completely in favor of tf.keras.layers.experimental.preprocessing.RandomCrop. That must've also added to the performance boost.
The performance boost without cache was minimal, whereas with cache was good. (~66% improvement)

You can see the trace view in this notebook. You'll need to execute the notebook since the logs are fetched remotely.

There are two things:

The Step Time Graph looks a bit weird. Here's the screenshot. Any probable reason for this?
How can we use cache during the actual training, since the dataset is too big?

Attached herewith are the logs used in the notebook.

Thanks.

logs.zip

sayakpaul commented 3 years ago

For TPUs, explicit reshaping is anyway needed. So, after parsing the TFRecords, we need to give our examples an explicit shape. After this, you should be able to incorporate the layers you couldn't.

You can see the trace view in this notebook. You'll need to execute the notebook since the logs are fetched remotely.

You can serialize the TensorBoard logs somewhere locally and save them if that's convenient. They can be visualized later inside a TensorBoard instance separately.

What kind of step time are we expecting with the latest changes? It would give us a better idea.

How can we use cache during the actual training, since the dataset is too big?

I don't think there is any hard and fast rule about this one. The general recommendation is to cache the expensive (deterministic) functions after mapping them to a batch of data. Regarding the volume, since each worker is going to receive a local batch of the data it should get evenly distributed.

How are doing now in terms of numbers, though? I am kind of a bit naive in this case :D I prefer to have something like - "Now 1000 images takes ~28.6 seconds to fetch as opposed to ...".

AdityaKane2001 commented 3 years ago

So, after parsing the TFRecords, we need to give our examples an explicit shape.

Should I do this using tf.Tensor.set_shape?

You can serialize the TensorBoard logs somewhere locally and save them if that's convenient.

Most of the important ones are in this issue itself. I just include the notebook if you want to take a look at the trace.

Regarding the volume, since each worker is going to receive a local batch of the data it should get evenly distributed.

I know you have said this earlier, but could you please provide a blog or some resource where I can read this up? I am a bit foggy on how TPUs exactly work.

How are doing now in terms of numbers, though?

A batch of 128 requires:

300 to 400 ms with caching
500 to 900 ms without caching

It's not as low as prescribed by TF guides, but I have tried to reduce retracing as much as possible.

What should be our target "speed" for this? I don't think we'll be able to reduce it much further by the means available, but that's just my opinion.

/cc @MorganR

sayakpaul commented 3 years ago

Should I do this using tf.Tensor.set_shape?

tf.reshape works just fine.

I know you have said this earlier, but could you please provide a blog or some resource where I can read this up? I am a bit foggy on how TPUs exactly work.

See if this helps. You can look up how synchronous distributed training works in general. There are a couple of lectures on this topic on Coursera. Those may be helpful too.

What should be our target "speed" for this?

Honestly, it's a bit too early to comment on this without actually getting it to train in the actual scenario. A pipeline that reduces the total idle time of the hardware accelerator should be deemed a good one IMO.

AdityaKane2001 commented 3 years ago

@sayakpaul

I have added the sharpness augmentation. I have excluded equalize, as it increased the training time a lot and used up entire available memory.

Today I observed something very off. The CPU is idling a lot. Could you please take a look? Here's the screenshot from the profiler.

Any probable reasons for this?

sayakpaul commented 3 years ago

What's your setup currently? Can you post your input pipeline without the comments and other style-oriented modifications? Will be easier for me to look into. The screenshot isn't telling me much about why the idle time is so high.

Also, equalize is a simpler operation than many of the others that we are using. So, it's a bit strange to me why, in particular, equalize is introducing a bottleneck.

AdityaKane2001 commented 3 years ago

Here's the pipeline.

ds = (tf.data.Dataset.list_files('gs://adityakane-imagenette-tfrecs')
    .interleave(tf.data.TFRecordDataset, 
        num_parallel_calls = tf.data.AUTOTUNE,
        deterministic=False)
    .map(
        lambda example: tf.io.parse_example(example, _TFRECS_FORMAT),
        num_parallel_calls = tf.data.AUTOTUNE)
    .map(
        lambda example: self.decode_example(example), 
        num_parallel_calls = tf.data.AUTOTUNE)
    .batch(self.batch_size)
    .map(
        lambda example: self._randaugment(example),
        num_parallel_calls = tf.data.AUTOTUNE)
    .map(
        lambda example: self._one_hot_encode_example(example),
        num_parallel_calls = tf.data.AUTOTUNE)
    .prefetch(tf.data.AUTOTUNE)
)

What's your setup currently?

Could you please elaborate which details you want?

Also, equalize is a simpler operation than many of the others that we are using. So, it's a bit strange to me why, in particular, equalize is introducing a bottleneck.

Just a guess, but could this be the issue? Since they've used Python native range(), the graph will include actual 3 blocks instead of a single tf.while.

sayakpaul commented 3 years ago

Could you please elaborate which details you want?

Machine configurations and if you are using a local file system to retrieve the TFRecords. Sorry, I should have made this clearer.

On your pipeline:

Can we combine parse_example and decode_example into one utility? Also, I would suggest against using lambdas as much as possible. See if you can use closures or some other means to directly map the utilities.
The one-hot encoding step could be performed before _randaugment(), isn't it?

Just a guess, but could this be the issue? Since they've used Python native range(), the graph will include actual 3 blocks instead of a single tf.while.

Nice catch. How about verifying this ourselves?

AdityaKane2001 commented 3 years ago

Machine configurations and if you are using a local file system to retrieve the TFRecords. Sorry, I should have made this clearer.

Actually, I only use Colab. So here's the config.

I'll make the changes and get back to you.

How about verifying this ourselves?

How can we do that?

sayakpaul commented 3 years ago

How can we do that?

Copy the initial implementation from the tfa and then replace then use tf.while()?

Let's use an AI Platform Notebook instance from here on.

AdityaKane2001 commented 3 years ago

On your pipeline:

Completed these things. Started using AI platform notebook.

The CPU remains idle for about 60% even while using an AI platform notebook. I removed most of the lambdas, but the following are sort of inevitable, as we need callables for the other function to map them.

https://github.com/AdityaKane2001/regnety/blob/4fa94d834d46b075edf17c8b45a3123c810ed146/regnety/dataset/augment.py#L211-L219

If I remove randaugment from the pipeline, the GPU idle time is reduced considerably, to approx 15%. But CPU idle time continues to be around 55-60%.

sayakpaul commented 3 years ago

Completed these things. Started using AI platform notebook.

Thank you. What machine are you using? Also, make sure these are from the GCP Credits you received for the free tier.

Could you try to make the augmentation even simpler? Something close to what I had done here? Since this is constructing an augmentation chain at random this might be introducing a bottleneck.

AdityaKane2001 commented 3 years ago

What machine are you using?

I am using n1-4 with 15 GB RAM and T4..

Also, make sure these are from the GCP Credits you received for the free tier.

I had used one coupon out of the five provided. That amount has not been revoked yet. But yes, for now I am using the free trial credits.

Since this is constructing an augmentation chain at random this might be introducing a bottleneck.

So for this I have used the following approach: Since we have 6 augmentations currently, and we need 2 of them, there will be 15 (sub)graphs (6C2), which is not much IMO. I've sorted the random augment list beforehand, so that order of augmentations is not to be considered.

If I understand correctly, we cannot use something like this as we need to be sure that we're applying exactly num_augs augmentations for each batch. Please correct me if I'm wrong.

sayakpaul commented 3 years ago

I am using n1-4 with 15 GB RAM and T4..

Would be a good idea to up to an n1-standard-8. Once we have provisioned the TPUs, there won't be any need to use GPUs.

If I understand correctly, we cannot use something like this as we need to be sure that we're applying exactly num_augs augmentations for each batch. Please correct me if I'm wrong.

So for this I have used the following approach: Since we have 6 augmentations currently, and we need 2 of them, there will be 15 (sub)graphs (6C2), which is not much IMO. I've sorted the random augment list beforehand, so that order of augmentations is not to be considered.

Okay. Then I am not sure why it'd introduce such a bottleneck given we are applying the chain on a batch of images.

Not sure why we CANNOT apply it. The random factor makes sure the stochasticity bit is kept intact. Maybe I am missing out on something, so feel free to expand more.

AdityaKane2001 commented 3 years ago

Not sure why we CANNOT apply it. The random factor makes sure the stochasticity beat is kept intact. Maybe I am missing out on something, so feel free to expand more.

For example we have

color jitter, probability 0.5
sharpen , probability 0.5
solarize, probability 0.5

then out of 100 runs, we can safely say that some 50 of them have color jitter, some 50 sharpen and some 50 solarize. We cannot be sure that exactly, say, 2 augmentations are applied for a given run. To maintain this determinism, I've done so. To optimize for performance, I'm sacrificing some meta-level randomness, ie the order of the augmentations.

The flip side of this is that still we're losing some performance. In your code, the graph is entirely defined and does not depend upon the augmentations chosen. But as I said earlier, we will be giving away control (number of augmentations) in that case.

If giving away this control is okay, then we can shift to the approach which you have illustrated. Please share your thoughts regarding this.

sayakpaul commented 3 years ago

then out of 100 runs, we can safely say that some 50 of them have color jitter, some 50 sharpen and some 50 solarize. We cannot be sure that exactly, say, 2 augmentations are applied for a given run. To maintain this determinism, I've done so. To optimize for performance, I'm sacrificing some meta-level randomness, ie the order of the augmentations.

This is actually fine. SimCLR discusses this idea in greater detail. Remember that we can also play with the probabilities to have more granularities.

The flip side of this is that still we're losing some performance. In your code, the graph is entirely defined and does not depend upon the augmentations chosen.

It's unclear to me. Could you explain?

AdityaKane2001 commented 3 years ago

It's unclear to me. Could you explain?

In your code, the decision that the augmentations are applied or not depends upon a threshold. Which means there are exactly two branches with every additional augmentation. In my case, the augmentations to be applied are defined by the numbers given by tf.random.uniform , and thus the graph must be traced during runtime. Hence we minimize the number of graphs to be formed during runtime. This is my understanding of the matter, but feel free to correct me if this seems off.

This is actually fine. SimCLR discusses this idea in greater detail. Remember that we can also play with the probabilities to have more granularities.

I'll take a look.

sayakpaul commented 3 years ago

In my case, the augmentations to be applied are defined by the numbers given by tf.random.uniform , and thus the graph must be traced during runtime. Hence we minimize the number of graphs to be formed during runtime.

Well, in this case too, the number of augmentations to be applied is non-determinstic in nature (which is encouraged). So, I am still not clear on the grounds on which you are making the performance comparisons.

On one hand, you mentioned:

In your code, the graph is entirely defined

On another, you mentioned:

In my case, the augmentations to be applied are defined by the numbers given by tf.random.uniform , and thus the graph must be traced during runtime.

This is also why I am confused.

AdityaKane2001 / regnety

Optimizing input pipeline #4