Closed viotemp1 closed 9 months ago
Hello @viotemp1. DLTensorPythonFunction
and PythonFunction
are meant for prototyping and debugging and should not be used when looking for good performance.
To achieve good performance, it is advised to extend DALI with custom operators written in C++/CUDA. You can read more about how to do that here: https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/examples/extend/create_a_custom_operator.html
@banasraf - any additional suggestion?
Got it. Thanks I'll use DALIDataset for some augmentations and add some more missing in DALI on top of tf dataset (on CPU unfortunately)
Hello @viotemp1 Would you be so kind to tell some details about the performance comparison you did? What two pipelines did you compare that gave you these results (600ms/step vs 50ms/step)? I would like to have a closer look on that. Although, taking into account that all python code called inside a pipeline is synchronous and runs on one thread, the performance hit should definitely be visible.
The difference is between using mean_filter2d (from tensorflow_addons.image) with DLTensorPythonFunction in Dali vs using it with tf data dataset map. Now I gave up using DALI for small datasets. When I'll need to use big datasets and large images I will try again.
from tensorflow_addons.image import mean_filter2d
@tf.function() def tf_mean_filter2d(x: tf.Tensor, min_effect_size=0, max_effect_size=2) -> tf.Tensor:
effect_size = np.random.randint(min_effect_size, max_effect_size)
#print(effect_size)
if effect_size > 0:
xout = mean_filter2d(x, filter_shape=effect_size) # , padding='CONSTANT',constant_values=1
else:
xout = x
#print("xout", xout.shape, type(xout))
return xout
.... class TFRecordPipeline(dali_Pipeline): .... self.tf_mean_filter2d = dali_ops.DLTensorPythonFunction(device="cpu", function=tf_mean_filter2d, synchronize_stream=True)
Thaks for the reply @viotemp1 What comes to my mind about the performance difference is that running this transformation in tf dataset utilizes more than one thread and it's not the case with DLTensorPythonFunction, unfortunately. We had some discussions about multithreaded Python operators, but it doesn't seem easy for now.
The performance hit should not be as big for GPU (although it still might be visible), because it doesn't need multiple cpu threads to be fast. Though, the challenge with GPU python operators is device synchronization. I'm not sure how should it look when using tensorflow_addons.
And FYI: I've just extended PythonFunction
operator to support GPU execution which uses CuPy arrays as data format and TorchPythonFunction
that operates on PyTorch tesnors for CPU and GPU. In both of these operators the device synchronization is transparent for a user, so if you would eventually like to explore the topic again, you might find those useful.
Hi @viotemp1
Thanks to enabling full asynchronously for the Python functions, the DLTensorPythonFuncion
should impose less overhead on the whole pipeline. Would you mind checking it now?
Hello,
I'm trying to add some more augmentations not yet available in DALI using DLTensorPythonFunction. I made some tests with tfa.image.mean_filter2d on cpu and the speed is more than 10 lower than using this function (profiler shows 600ms/step vs 50ms/step). I tried also PythonFunction, but this is even slower.
What can I do to make it faster?
Part of code is: ... from tensorflow_addons.image import mean_filter2d ... def tf_mean_filter2d(x: tf.Tensor) -> tf.Tensor:
print("x", len(x), type(x), type(x[0]), x[0]) # x.shape
...... self.tf_mean_filter2d = dali_ops.DLTensorPythonFunction(device="cpu", function=tf_mean_filter2d, synchronize_stream=True) .... On GPU I cannot make it working most likely because of the mean_filter2d function. Thanks