bottleneck ? - Githubissues

hiddentn commented 6 years ago

this line of code take a long time to execute https://github.com/ModelDepot/tfjs-yolo-tiny/blob/master/src/postprocess.js#L24

const mask_arr = await prediction_mask.data();

is there an other way to do it(transforming the tensor to an array) ?? :) EDIT : its Bottleneck-0 in the image btw

MikeShi42 commented 6 years ago

What's happening there is that the data has to be transferred from GPU to CPU. I couldn't figure out a way to do something like boolean mask on GPU only. Perhaps there is a way to transfer less data than I currently do or parallelize more tasks in the meantime? I'm not sure.

It's possible to also ping the TF team and ask about boolean_mask support. Awesome profiling though! Would it be possible to document a bit about what you've found so far so there's a reference of where biggest parts of improvements could be?

hiddentn commented 6 years ago

HI thank you for your reply i got those times on a GPU : gtx850m 4GB ddr3

on CPU (i7-4710HQ/Intel HD4600) a get about 850ms/image and weirdly less predictions ? :D screenshot 5

i will refactor the code and do some clean up and upload it. (maybe tomorrow)

about some improvements i was wondering if its possible to perform more operations on tensors before dowloading the final data to the cpu (like the iou or the box filtering & class thresholds)

i am currently reading the original darknet lib and the darkflow port maybe i can get some clues

i will post any findings

jacobgil commented 6 years ago

Hi, I'm interested in this too! I'm trying running an extremely light weight model (on CPU), the forward pass takes 10ms. I actually got to the same bottleneck code.

So I replaced the code with a tensorflow equivalent:

  const prediction_mask = tf.greaterEqual(box_class_scores, tf.scalar(threshold)).as1D();
  const N = prediction_mask.size
  const all_indices = tf.tensor1d(Array.from(Array(N).keys()));
  const neg_indices = tf.tensor1d(new Array(N).fill(0));
  const indices = tf.where(prediction_mask, all_indices, neg_indices).toInt();

So .data isn't called (index 0 represents the objects that don't survive the threshold, but those will be filtered later anyway, and it's only a single index).

This indeed shrinked the run time of that part of the code to an order of 10ms.

But then in the yolo function, there is
const classes_indx_arr = await classes.gather(tf.tensor1d(keep_indx).toInt()).data(); And I didn't find a way to reduce that yet.

jacobgil commented 6 years ago

The really strange thing for me, is that I'm running this on a laptop without a GPU, and still have the same issue.

MikeShi42 commented 6 years ago

@jacobgil wow that's really clever! 😍 I scratched my head quite a bit at how to not use gross data transferring and couldn't figure out how exactly to use where for this case haha, I'll admit I need to chew on it a bit still to understand the implications of index 0, is the assumption that class 0 will most likely have a low prob and get filtered out by the classProbThreshold?

I want to say that last data transfer is hard to get rid of (we need to exit tensor-land eventually). It seems like we can move the classProbThreshold into tensor-land by borrowing the same greaterEqual/where/gather masking logic. I haven't touched the code for a bit so I can't think of what the potential impact of that would be (can't speculate on latency/data transfer speed).

Would you mind landing in a PR with your proposed new code? And some pre/post optimization benchmarks?

p.s. I think that masking logic alone would make a great NPM module haha, unless boolean_masks are already a thing or upcoming on the tfjs roadmap (the project moves too fast for me to keep up)

jacobgil commented 6 years ago

Yes the assumption is that if class 0 survives the initial filtering (but has a very low probability), it can't survive the next filtering by classProbThreshold anyway.

Unfortunately this doesn't speed up things for the entire program, since the .data transfer happens in another place in code anyway, so pre/post optimization benchmarks show that the total run time isn't changed, only the run time of that specific part in the code.

I'l try figure out a way to move classProbThreshold into tensor-land.

MikeShi42 commented 6 years ago

@jacobgil another note: you can also use tf.linspace and tf.zeros for all_indices and neg_indices instead of creating a tensor via a JS array.

It seems weird though that eliminating that data transfer doesn't improve perf. I've read somewhere that timing tf.js accurately can be a bit tricky, but I can't currently find it. I now notice though there's some nice timing tools in the latest version of tf.js that I think would be great to play with.

I've just put out a post on the mailing group to see if they can shed and insights into benchmark best practices. :) https://groups.google.com/a/tensorflow.org/forum/#!topic/tfjs/_IDVt3wQFXA

MikeShi42 commented 6 years ago

So doing a bit more digging, I surround model.predict using performance.now() but I'm pretty confident that 10ms per prediction is a fluke (either that or my MBP is has the same processing power of a Titan X hehe). This is indeed most likely related to the issue I thought I've read before as mentioned above.

I ran chrome's perf tool instead to see where the breakdown is, and it seems to mainly lie in Model.predict, very little time actually seems to be spent on dumping data (to an extent not too surprising because there is very little data being transferred). yolo_head also chews a bit of time with its tensor ops, but there doesn't seem to be any inefficiency that would lead to 2x perf gains.

Relevant Call Tree:

There are other potential things to try to improve inference time though:

Nikhil has previously mentioned that separable conv2d layers would be something worth looking into, but considering that would involve re-training YOLO, I think that's something I'd have to find some time down the line to do :)

hiddentn commented 6 years ago

@MikeShi42 you can remove BatchNorm layers from the model as they are not needed for inference(i think) i managed to do it but i am not very sure about the results here a script on tiny-yolo coco :

import os
import numpy as np
import keras
from keras.models import Sequential, load_model
from keras.layers import Conv2D, MaxPooling2D
from keras.layers.advanced_activations import LeakyReLU
import tensorflowjs as tfjs

model_path = "tinyv2.h5"

# Load the model that was exported by YAD2K.
model = load_model(model_path)
# model.summary()
model_nobn = Sequential()
model_nobn.add(Conv2D(16, (3, 3), padding="same", input_shape=(416, 416, 3)))
model_nobn.add(LeakyReLU(alpha=0.1))
model_nobn.add(MaxPooling2D())
model_nobn.add(Conv2D(32, (3, 3), padding="same"))
model_nobn.add(LeakyReLU(alpha=0.1))
model_nobn.add(MaxPooling2D())
model_nobn.add(Conv2D(64, (3, 3), padding="same"))
model_nobn.add(LeakyReLU(alpha=0.1))
model_nobn.add(MaxPooling2D())
model_nobn.add(Conv2D(128, (3, 3), padding="same"))
model_nobn.add(LeakyReLU(alpha=0.1))
model_nobn.add(MaxPooling2D())
model_nobn.add(Conv2D(256, (3, 3), padding="same"))
model_nobn.add(LeakyReLU(alpha=0.1))
model_nobn.add(MaxPooling2D())
model_nobn.add(Conv2D(512, (3, 3), padding="same"))
model_nobn.add(LeakyReLU(alpha=0.1))
model_nobn.add(MaxPooling2D(strides=(1, 1), padding="same"))
model_nobn.add(Conv2D(1024, (3, 3), padding="same"))
model_nobn.add(LeakyReLU(alpha=0.1))
model_nobn.add(Conv2D(512, (3, 3), padding="same"))
model_nobn.add(LeakyReLU(alpha=0.1))
model_nobn.add(Conv2D(425, (1, 1), padding="same", activation='linear'))
#model_nobn.summary()
def fold_batch_norm(conv_layer, bn_layer):
    """Fold the batch normalization parameters into the weights for 
       the previous layer."""
    conv_weights = conv_layer.get_weights()[0]

    # Keras stores the learnable weights for a BatchNormalization layer
    # as four separate arrays:
    #   0 = gamma (if scale == True)
    #   1 = beta (if center == True)
    #   2 = moving mean
    #   3 = moving variance
    bn_weights = bn_layer.get_weights()
    gamma = bn_weights[0]
    beta = bn_weights[1]
    mean = bn_weights[2]
    variance = bn_weights[3]

    epsilon = 1e-3
    new_weights = conv_weights * gamma / np.sqrt(variance + epsilon)
    new_bias = beta - mean * gamma / np.sqrt(variance + epsilon)
    return new_weights, new_bias

W_nobn = []
W_nobn.extend(fold_batch_norm(model.layers[1], model.layers[2]))
W_nobn.extend(fold_batch_norm(model.layers[5], model.layers[6]))
W_nobn.extend(fold_batch_norm(model.layers[9], model.layers[10]))
W_nobn.extend(fold_batch_norm(model.layers[13], model.layers[14]))
W_nobn.extend(fold_batch_norm(model.layers[17], model.layers[18]))
W_nobn.extend(fold_batch_norm(model.layers[21], model.layers[22]))
W_nobn.extend(fold_batch_norm(model.layers[25], model.layers[26]))
W_nobn.extend(fold_batch_norm(model.layers[28], model.layers[29]))
W_nobn.extend(model.layers[31].get_weights())
model_nobn.set_weights(W_nobn)

# Make a prediction using the original model and also using the model that
# has batch normalization removed, and check that the differences between
# the two predictions are small enough. They seem to be smaller than 1e-4,

print("Comparing models...")

image_data = np.random.random((1, 416, 416, 3)).astype('float32')
features = model.predict(image_data)
features_nobn = model_nobn.predict(image_data)

max_error = 0
for i in range(features.shape[1]):
    for j in range(features.shape[2]):
        for k in range(features.shape[3]):
            diff = np.abs(features[0, i, j, k] - features_nobn[0, i, j, k])
            max_error = max(max_error, diff)
            if diff > 1e-4:
                print(i, j, k, ":", features[0, i, j, k], features_nobn[0, i, j, k], diff)

print("Largest error:", max_error)
print("Converting...")
tfjs.converters.save_keras_model(model_nobn, "./Converted-NoBatch/")
print("yay")
print("Done!")

also check this out because i think he has a better way of post processing

MikeShi42 commented 6 years ago

@TheHidden1 do you have a set of tfjs weights that can be pointed to? :) Or do you need help converting the Keras model to tfjs format?

Also, I don't think that other method would be faster, the current method uses a good amount of parallelism (vectorized), whereas the one linked seemed to use a triple for-loop, thus making it a serial operation.

hiddentn commented 6 years ago

https://github.com/TheHidden1/tiny-yolo-noBatch

i am not sure about the Perfs impact but this definitively uses less memory

right now i think the post processing is messy , from what i can uderstand right now there are 3 main things to do in tensorland to minimize the tensor.data() impact :

Box filter NonMaxSuppression Class Prob filter

i think these 3 can be done with conv layer to obtain some kind of final indices tensor tf.gather the output (the 5 or 6 final detection ) then call data() on it

i will try to do it but honestly i am still wayyy too noob in tensorflow

justadudewhohacks commented 6 years ago

Hey interesting discussion I am facing the exact same issue.

I implemented a ssd mobilenet architecture, but fetching the tensor data is the bottleneck here. I removed the post processing layer and do non max suppression and confidence filtering manually.

Initially my guess was, doing the post processing e.g. non max suppression and confidence filtering on the gpu would fix the bottleneck, because you would have to transfer less data, as @TheHidden1 suggested.

Now the weird thing is, no matter if I call scores.data() to get the entire tensor data, or if I reduce the size of the tensor to a single entry with tf.slice beforehand, it takes the same time for both tensors to fetch the data from gpu. Thus I suspect, it doesn't really matter if you do filtering and non max suppression on the gpu or cpu.

Anyways hitting a performance bottleneck with fetching the tensor data (~100ms) whereas actual forward time is ~30ms on my system.

hiddentn commented 6 years ago

@justadudewhohacks how exactly do you slice beforehand ?? and how do you do nms on gpu ?

I managed ~20ms forward time +pre/post processing on tinyyolov2 and tinyyolov3
on a gtx 850m 4gb

sadly data() still takes about ~190ms witch is really frustrating

justadudewhohacks commented 6 years ago

Basically: const sliced = tf.slice(scores[0], 0, 1), which gives me a tensor of shape [1]. Calling .data() on that one is ~100ms for some reason.

I think there must be some prefetching going on. For example if I do:

await tensor1.data()
await tensor2.data()

then fetching the data of tensor2 is ~10ms and the first fetch is ~100ms. Otherwise fetching tensor2 would also be about 100ms.

This is quite odd behavior in my opinion. Will have to investigate into it.

Regarding nms, I don't do it on the gpu. As I said I do all post processing on the cpu. The code for nms is from my PR in tfjs-core. Just passing scores and maxOutputSize as a number array instead of tensors.

MikeShi42 commented 6 years ago

I'm not sure how y'all are looking at timing, but my suspicion and I forgot if I already said this somewhere above is that because tensor.data() waits for the GPU to be free, if there are tensor operations running (inference), then the delay/wait you're seeing on the .data() call is most likely because it's waiting for the inference step to finish before downloading data to CPU (not because the download actually takes 100ms to transfer).

I haven't benchmarked the code in a while but I think the GPU usage graph in Chrome could be a good telltale sign on if the GPU is busy doing inference for only 20/30ms or if it's actually taking 120ms but because it's GPU code, our JS burndowns don't capture the timing appropriately.

It'd be great if y'all shared burndown/GPU usage graphs that could probably help in documenting perf improvements :)

Just my 2cents

Also @TheHidden1 sorry I haven't had time to look at non-BN model yet, I've unfortunately been very busy with other stuff :'(

justadudewhohacks commented 6 years ago

Ohhh well that of course explains the behavior. You probably just saved me a lot of time breaking down the issue @MikeShi42. Thanks and very well explained!

Also sorry, that my issue is not related at all to tfjs-yolo-tiny, but ssd in general. This issue got linked from tfjs somewhere and I didn't notice me ending up in a different repo.

Thanks and have a great weekend!

MikeShi42 commented 6 years ago

@justadudewhohacks no worries! This issue is one that plagues obj detection models with similar post-processing so it's totally relevant. And what I said is just a hunch that I think I've tested before, but the profiling tools were a bit finicky on Chrome for some reason so I can't confirm 100% that that's the most likely root cause (but it seems reasonable and explains certain behaviors), but is a good place for you to sanity check as well if you wish :)

Hope everything is working out well for you!

justadudewhohacks commented 6 years ago

Just gave it a shot by awaiting a small timeout before calling .data() and indeed what you explained is what I was facing. When awaiting a timeout of 100ms, the actual fetching of data is indeed just a few ms for the entire tensor. So I guess 100 - 150ms is simply the time for inference on my gpu.

Thanks again!

hiddentn commented 6 years ago

@justadudewhohacks @MikeShi42 i tried measuring the timings with tf.time and i am still getting the same results (i think) the downloadWaitMs is still the major bottleneck while kernelMs is fine https://groups.google.com/a/tensorflow.org/forum/#!topic/tfjs/YfH5_GTnx3E

screenshot 28 ps : i made some changes to the post processing algorithm

MikeShi42 commented 6 years ago

@TheHidden1 Is that timing the entire execution of the model + post-processing, or only post-processing?

If it's only post, is there any chance you could try timing the entire model + post-processing? Mainly for sanity checking that the kernel timing is accurate (not just CPU time spent on inference, which there would be very little)

hiddentn commented 6 years ago

@MikeShi42 the timing are for the whole inference pipeline(pre-processing +prediction +post-processing)

i know the KernelMs times look weird but the docs says

kernelMs: Kernel execution time, ignoring data transfer.

and on the tfjs google group

The kernel time is just the time taken to execute the current process on the tensorflowjs backend (CPUor GPU), so wallTime = kernelTime(current process time) + otherProcesses time (incl. data transfer time).

it also may very well be a bug in tf.time or simply i am a wizard that made such a good post processing 🤣 🤣 i think the former

MikeShi42 commented 6 years ago

@TheHidden1 Hrmmm... is YOLO rated to run at ~500 FPS on your system? It's not insane but at the same time 1-2ms inference times sound hard to believe.

Though I agree the documentation, support and the test cases seem to support that meaning. Though I guess it also confuses me how justadude above experiences less delay on download when awaiting before the tensor download?

hiddentn commented 6 years ago

@MikeShi42 can you try and do the benchmark yourself to see if you can get similar results ?

hiddentn commented 6 years ago

this is on the full yolov3 screenshot 31

ithink there is something wrong with tf.time() or i am doing something wrong

hiddentn commented 6 years ago

ehehehhehehe https://groups.google.com/a/tensorflow.org/forum/?utm_medium=email&utm_source=footer#!msg/tfjs/YfH5_GTnx3E/Xx95BT4SAgAJ

Right now tf.time() is totally busted. The GPU timer we use was turned off in chrome recently because of Spectre / Meltdown.

MikeShi42 commented 6 years ago

@TheHidden1 thanks for the follow up, makes sense!

ModelDepot / tfjs-yolo-tiny

bottleneck ? #6