After upgrading MLPrimitives to version 0.3, Orion benchmarking pipelines fail when ran in parallelization (whether using dask or multiprocessing). In this latest release, tensorflow dramatically changed the underlying computation and composition of models which I believe is the reason for this breakage.
For reference, I am using the lstm_dynamic_threshold pipeline in Orion which uses the keras.Sequential.LSTMTimeSeriesRegressor. The output first generates warnings which indicate there is excessive computation happening in the adapter, I presume some tweaks need to be made for this to work properly
WARNING:tensorflow:5 out of the last 12 calls to <function
Model.make_test_function.<locals>.test_function at 0x7f218041c4d0>
triggered tf.function retracing.
Tracing is expensive and the excessive number of tracings could be due to
(1) creating @tf.function repeatedly in a loop,
(2) passing tensors with different shapes,
(3) passing Python objects instead of tensors.
For (1), please define your @tf.function outside of the loop.
For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing.
For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.
After 21 pipeline runs, the code fails with the dreaded Segmentation fault (core dumped).
Note
This problem only occurs when attempting to parallel execute pipelines. When benchmarking serially, there is no issue
After investigating the settings of the system, the cause of this is a memory issue. You can use multiprocessing to release the memory after finishing computation and it will work fine.
Description
After upgrading MLPrimitives to version 0.3, Orion benchmarking pipelines fail when ran in parallelization (whether using
dask
ormultiprocessing
). In this latest release, tensorflow dramatically changed the underlying computation and composition of models which I believe is the reason for this breakage.For reference, I am using the
lstm_dynamic_threshold
pipeline in Orion which uses thekeras.Sequential.LSTMTimeSeriesRegressor
. The output first generates warnings which indicate there is excessive computation happening in the adapter, I presume some tweaks need to be made for this to work properlyAfter 21 pipeline runs, the code fails with the dreaded
Segmentation fault (core dumped)
.Note This problem only occurs when attempting to parallel execute pipelines. When benchmarking serially, there is no issue