viotemp1 commented 4 years ago

Hello,

I'm trying to add some more augmentations not yet available in DALI using DLTensorPythonFunction. I made some tests with tfa.image.mean_filter2d on cpu and the speed is more than 10 lower than using this function (profiler shows 600ms/step vs 50ms/step). I tried also PythonFunction, but this is even slower.

What can I do to make it faster?

Part of code is: ... from tensorflow_addons.image import mean_filter2d ... def tf_mean_filter2d(x: tf.Tensor) -> tf.Tensor:

print("x", len(x), type(x), type(x[0]), x[0]) # x.shape

effect_size = np.random.randint(0, 3)
if effect_size > 0:
    x_list = [tfdlpack.from_dlpack(dltensor) for dltensor in x]
    x_list = tf.convert_to_tensor(x_list, dtype=tf.int16)
    out_list = mean_filter2d(x_list, filter_shape=effect_size)
    xout = [tfdlpack.to_dlpack(t) for t in out_list]
else:
    xout = x
return xout

...... self.tf_mean_filter2d = dali_ops.DLTensorPythonFunction(device="cpu", function=tf_mean_filter2d, synchronize_stream=True) .... On GPU I cannot make it working most likely because of the mean_filter2d function. Thanks

jantonguirao commented 4 years ago

Hello @viotemp1. DLTensorPythonFunction and PythonFunction are meant for prototyping and debugging and should not be used when looking for good performance.

To achieve good performance, it is advised to extend DALI with custom operators written in C++/CUDA. You can read more about how to do that here: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/examples/extend/create_a_custom_operator.html

JanuszL commented 4 years ago

@banasraf - any additional suggestion?

viotemp1 commented 4 years ago

Got it. Thanks I'll use DALIDataset for some augmentations and add some more missing in DALI on top of tf dataset (on CPU unfortunately)

banasraf commented 4 years ago

Hello @viotemp1 Would you be so kind to tell some details about the performance comparison you did? What two pipelines did you compare that gave you these results (600ms/step vs 50ms/step)? I would like to have a closer look on that. Although, taking into account that all python code called inside a pipeline is synchronous and runs on one thread, the performance hit should definitely be visible.

viotemp1 commented 4 years ago

The difference is between using mean_filter2d (from tensorflow_addons.image) with DLTensorPythonFunction in Dali vs using it with tf data dataset map. Now I gave up using DALI for small datasets. When I'll need to use big datasets and large images I will try again.

from tensorflow_addons.image import mean_filter2d

@tf.function() def tf_mean_filter2d(x: tf.Tensor, min_effect_size=0, max_effect_size=2) -> tf.Tensor:

print("x", x.shape, type(x))

effect_size = np.random.randint(min_effect_size, max_effect_size)
#print(effect_size)
if effect_size > 0:
    xout = mean_filter2d(x, filter_shape=effect_size) # , padding='CONSTANT',constant_values=1
else:
    xout = x
#print("xout", xout.shape, type(xout))
return xout

.... class TFRecordPipeline(dali_Pipeline): .... self.tf_mean_filter2d = dali_ops.DLTensorPythonFunction(device="cpu", function=tf_mean_filter2d, synchronize_stream=True)

banasraf commented 4 years ago

Thaks for the reply @viotemp1 What comes to my mind about the performance difference is that running this transformation in tf dataset utilizes more than one thread and it's not the case with DLTensorPythonFunction, unfortunately. We had some discussions about multithreaded Python operators, but it doesn't seem easy for now.

The performance hit should not be as big for GPU (although it still might be visible), because it doesn't need multiple cpu threads to be fast. Though, the challenge with GPU python operators is device synchronization. I'm not sure how should it look when using tensorflow_addons.

And FYI: I've just extended PythonFunction operator to support GPU execution which uses CuPy arrays as data format and TorchPythonFunction that operates on PyTorch tesnors for CPU and GPU. In both of these operators the device synchronization is transparent for a user, so if you would eventually like to explore the topic again, you might find those useful.

JanuszL commented 9 months ago

Hi @viotemp1

Thanks to enabling full asynchronously for the Python functions, the DLTensorPythonFuncion should impose less overhead on the whole pipeline. Would you mind checking it now?

NVIDIA / DALI

DLTensorPythonFunction speed #1614

print("x", len(x), type(x), type(x[0]), x[0]) # x.shape

print("x", x.shape, type(x))