aleju / imgaug

Image augmentation for machine learning experiments.
http://imgaug.readthedocs.io
MIT License
14.43k stars 2.45k forks source link

Multiprocessing and tensorflow 2.x #661

Open oak-tree opened 4 years ago

oak-tree commented 4 years ago

Hey,

Background

We are heavly use with imgaug for augmentation image. We do it over custom generator . We built this custom generator to serve keras fit_generator function. For performance reason we do use multiprocess for fit_generator. This cause keras to spawn a pool of workers to handle each generated item.

Issue

Since tf 2.x we started to get the following warnning from tf

WARNING:tensorflow:multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended. [2020-05-11 07:49:40,808] - WARNING: multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.

Any idea how to overcome this ? I recall the imauag is not multithread safe.

aleju commented 4 years ago

imgaug should be thread-safe I think. You only have to be careful with the seed that each child process uses, otherwise you risk that all workers use the same seed and generate the same transformations (just applied to different images). There is also the problem of ensuring reproduceability when you can't be sure which worker process will get which batch of data, so you might have to set the worker's seed on a per-batch basis, conditional on the batch's unique ID.

I'm not that familiar with tf.data, but as far as I know it basically comes down to generating the dataset of examples once and then (during train/eval) applying only tensorflow functions onto it. I.e. no numpy data is allowed. imgaug does not have such tensorflow implementations of its operations, and therefore cannot be used in this way. The only way to still use it is to apply imgaug during the dataset generation, e.g. by saving each image not once, but ten augmented times.

tnybny commented 4 years ago

You're right, imgaug is not thread-safe, so when you use multiprocessing=True, you have to use the "forkserver" method to create new processes and avoid deadlocks, set by using import multiprocessing as mp followed by mp.set_start_method("forkserver") at the start of your program. The tensorflow warnings will still show up, but can be safely ignored and you should not run into any deadlocks. When you do this, you also have to make sure that all objects in your generator are picklable.

But, as @aleju alluded, you might have to be careful to set different imgaug seeds for each child process so that all workers are not performing the same augmentations in the same order on all your data.

anhtu812 commented 3 years ago

can turn off multiprocessing in imgaug or only use multithread ?