Migrate to TensorFlow 2.X

TjarkMiener commented 4 years ago

I was trying to update the CTLearn 'multitask' branch to TensorFlow 2.X and I faced some difficulties/concerns about the HeadAPI and Graph structure with Estimators, which I wanted to discuss. I pushed some (unfinished) code in the 'tf2' branch here in CTLearn. So far, I focused on the single_tel model. The code starts training (with reasonable loss values), but don't increase the steps (-> train forever without reaching the evaluation). I'm not sure yet what's causing this behavior.

What is the problem about the HeadAPI and Graph structure in TF 2.X?

As you all know, TensorFlow 2.X integrated the Keras API to make the framework more pythonic. Essentially, moving from the Graph structure with Estimators to the eager execution. Since the TF layers are deprecated, TF 2.X are forcing the users to use the Keras Layers and the Keras Model. However, the estimators and Graph structure are still "supported". It might be better to say they are still there than supported. This blog post sum up the situation pretty well (worth reading the conclusion at least). This post is pointing out that the Keras Models (with Keras Layers) are not interacting very nicely with a custom non-Keras way of calculating and minimizing the loss (as the HeadAPI does).

Besides that the current 'tf2' branch printing out some warnings that function will be deprecated in the future. 1) Due to our Graph structure:

WARNING:tensorflow:From /home/tjark/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
WARNING:From /home/tjark/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow_core/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
WARNING:tensorflow:From ctlearn/run_model.py:240: DatasetV1.make_one_shot_iterator (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use 'for ... in dataset:' to iterate over a dataset. If using 'tf.estimator', return the 'Dataset' object directly from your input function. As a last resort, you can use 'tf.compat.v1.data.make_one_shot_iterator(dataset)'.

2) Due to the HeadAPI implementation:

WARNING:tensorflow:From /home/tjark/anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/head/binary_class_head.py:206: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.

Multitask Learning is also supported by the Keras API, so we could potentially switch to Keras, when updating to TF 2.X. Any thoughts?

STSpencer commented 2 years ago

Hi all,

Just to let you know, other projects are having serious trouble with older CUDA versions and the new A100/RHEL8 upgrade on Wilkes. We can ask them to create new modules, but it seems like there's genuine driver issues with getting RHEL8 to play nicely with CUDA 10.0 and tf1.x (see for example https://gist.github.com/sub-mod/a21a2d71026a9d25f8b687e369e9729e).

Apparently there's at least 3 branches of CTLearn regarding Tensorflow 2. I suggest you either merge these into the main branch (including updating the environment .yml files) or just use the tf.compat.v1 classes instead in Tensorflow 2.7.0 (if nothing else to stop getting bugged by this github advisory https://github.com/advisories/GHSA-57wx-m983-2f88).

All the best, Sam

TjarkMiener commented 2 years ago

Hi @STSpencer! Thanks for reactivating this important issue! Given the fact that most of the newest GPUs do not support CUDA10 we are forced to migrate the keras backend of TF2.X. This issue is the most pressing one.

ctlearn-project / ctlearn

Migrate to TensorFlow 2.X #127