CosmoStat / wf-psf

Data-driven wavefront-based PSF modelling framework.
MIT License
19 stars 8 forks source link

Update to recent versions of TensorFlow #90

Open jeipollack opened 9 months ago

jeipollack commented 9 months ago

WaveDiff implements the Rectified Adam Optimiser from the TensorFlow Addons library that has stopped development (see details in link). Minimal maintenance releases will continue until May 2024. As a result, it is not compatible with the latest versions of TensorFlow 2.11+. Interestingly, 2.9+ is also affected resulting in the following error reported in Issue #88 which results when loading a saved checkpoint. The Rectified Adam Optimiser is currently not part of the core library of TensorFlow 2. The fix is use the tf.keras.optimizers.legacy namespace (see here) that allows the old optimizers to work.

This issue to do a couple of tasks:

nadamoukaddem commented 9 months ago

After updating the optimizer to the legacy version, running the validation test on shape metrics resulted in the following error:

assert ratio_rmse_e1 < tol
E       assert 2.450450287128092e-09 < 1e-09

Lowering the tolerance to 1e-8 gives this error:

assert ratio_rel_rmse_e1 < tol
E       assert 1.647135320670401e-05 < 1e-08
nadamoukaddem commented 6 months ago

I evaluated the performance using the Adam Optimizer versus the Rectified Adam Optimizer, and here are the results I obtained: Adam_Vs_RectifiedAdam I used these as the number of training epochs:

Number of training epochs for training the parametric model parameters per cycle.
      n_epochs_params: [15, 15] 
 Number of training epochs for training the non-parametric model parameters per cycle.
      n_epochs_non_params: [100, 50]
jeipollack commented 6 months ago

Thanks @nadamoukaddem for this update. Just to confirm did you use the same random seed and learning rates for both runs? Perhaps you could share the complete training configuration file parameters.

nadamoukaddem commented 6 months ago

Yes, I used the same random seed and learning rates for both runs. Here is the training_config.yaml file:

training: # Run ID name id_name: -coherent_euclid_200stars # Name of Data Config file data_config: data_config.yaml # Metrics Config file - Enter file to run metrics evaluation else if empty run train only metrics_config: metrics_config.yaml # PSF model parameters model_params: # Model type. Options are: 'mccd', 'graph', 'poly, 'param', 'poly_physical'." model_name: poly

\#Num of wavelength bins to reconstruct polychromatic objects.
n_bins_lda: 8 

\#Downsampling rate to match the oversampled model to the specified telescope's sampling.
output_Q: 3

\#Oversampling rate used for the OPD/WFE PSF model.
oversampling_rate: 3 

\#Dimension of the pixel PSF postage stamp
output_dim: 32

\#Dimension of the OPD/Wavefront space."
pupil_diameter: 256

\#Boolean to define if we use sample weights based on the noise standard deviation estimation
use_sample_weights: True 

\#Interpolation type for the physical poly model. Options are: 'none', 'all', 'top_K', 'independent_Zk'."
interpolation_type: None

\# SED intepolation points per bin
sed_interp_pts_per_bin: 0

\# SED extrapolate 
sed_extrapolate: True

\# SED interpolate kind
sed_interp_kind: linear

\# Standard deviation of the multiplicative SED Gaussian noise.
sed_sigma: 0

\#Limits of the PSF field coordinates for the x axis.
x_lims: [0.0, 1.0e+3]

\#Limits of the PSF field coordinates for the y axis.
y_lims: [0.0, 1.0e+3]

\# Hyperparameters for Parametric model 
param_hparams:
  \# Set the random seed for Tensor Flow Initialization
  random_seed: 3877572

  \# Set the parameter for the l2 loss function for the Optical path differences (OPD)/WFE
  l2_param: 0.

  \#Zernike polynomial modes to use on the parametric part.
  n_zernikes: 15

  \#Max polynomial degree of the parametric part.  m
  d_max: 2  

  \#Flag to save optimisation history for parametric model
  save_optim_history_param: true

\# Hyperparameters for non-parametric model
nonparam_hparams:
  \#Max polynomial degree of the non-parametric part. chg to max_deg_nonparam
  d_max_nonparam: 5 

  \# Number of graph features
  num_graph_features: 10

  \#L1 regularisation parameter for the non-parametric part."
  l1_rate: 1.0e-8

  \#Flag to enable Projected learning for DD_features to be used with `poly` or `semiparametric` model.
  project_dd_features: False

  \#Flag to reset DD_features to be used with `poly` or `semiparametric` model
  reset_dd_features: False

  \#Flag to save optimisation history for non-parametric model
  save_optim_history_nonparam: true 

# Training hyperparameters training_hparams:

\# Batch Size
batch_size: 32

\# Multi-cyclic Parameters     
multi_cycle_params:

  \# Total number of cycles to perform for training.  
  total_cycles: 2

  \# Train cycle definition. It can be: 'parametric', 'non-parametric', 'complete', 'only-non-parametric' and 'only-parametric'."
  cycle_def: complete

  \# Flag to save all cycles. If "True", create a checkpoint at every cycle, else if "False" only save the checkpoint at the end of the training."
  save_all_cycles: True

  \# Learning rates for training the parametric model parameters per cycle.
  learning_rate_params: [0.01, 0.004]

  \# Learning rates for training the non-parametric model parameters per cycle.
  learning_rate_non_params: [0.1, 0.06]

  \# Number of training epochs for training the parametric model parameters per cycle.
  n_epochs_params: [15, 15] 

  \# Number of training epochs for training the non-parametric model parameters per cycle.
  n_epochs_non_params: [100, 50]

nadamoukaddem commented 6 months ago

I remember I used the same TensorFlow 2.15.0 version for both optimizers. It's been a while since I ran these tests, I need to rerun them.

jeipollack commented 6 months ago

Okay. Could you run the tests 2-3 more times using different random numbers to see if the results are consistent? Make sure to use the same set of random numbers for both optimizers.

nadamoukaddem commented 6 months ago

I ran the tests twice with different random numbers, and the metrics changed slightly. However, I noticed something strange: when I used Adam or Rectified Adam, I'm getting the same numbers for the metrics. I'm using TensorFlow 2.15.0

jeipollack commented 6 months ago

I don't understand what you mean as your first statement (metrics differ) seems contrary to the second statement (metrics are the same). Try explaining more clearly and post the outputs.

nadamoukaddem commented 6 months ago

I have included the configurations I used in the attached text document. These include configs.yaml, training_config.yaml, and training_config_1.yaml. I didn't make any changes to the other configuration files. configurations.odt This is the output i get for adam and rectified adam optimizers for 2 random numbers:

Rectified Adam

'rmse_e1': 0.023490250341625767, 'std_rmse_e1': 0.015843164999311765, 'rel_rmse_e1': 555.6691819599646, 'std_rel_rmse_e1': 544.2162301831149, 'rmse_e2': 0.009168282438918063, 'std_rmse_e2': 0.008922320388766475, 'rel_rmse_e2': 346.9258300561109, 'std_rel_rmse_e2': 346.38851755041367, 'rmse_R2_meanR2': 0.03669683267550114, 'std_rmse_R2_meanR2': 0.025587624399757175, 'pix_rmse': 9.126631e-05, 'pix_rmse_std': 2.2697419e-05, 'rel_pix_rmse': 6.096496433019638, 'rel_pix_rmse_std': 2.035357244312763,

'rmse_e1': 0.0241800061185133, 'std_rmse_e1': 0.014729668133574656, 'rel_rmse_e1': 550.9202217189359, 'std_rel_rmse_e1': 538.483941515418, 'rmse_e2': 0.00846779738001486, 'std_rmse_e2': 0.008098060249920496, 'rel_rmse_e2': 336.6814546254379, 'std_rel_rmse_e2': 332.86960115973426, 'rmse_R2_meanR2': 0.10814865342870499, 'std_rmse_R2_meanR2': 0.04149496242444061, 'pix_rmse': 9.000804e-05, 'pix_rmse_std': 2.2791739e-05, 'rel_pix_rmse': 6.022337824106216, 'rel_pix_rmse_std': 1.9995957612991333,

Adam

rmse_e1': 0.023490250341625767, 'std_rmse_e1': 0.015843164999311765, 'rel_rmse_e1': 555.6691819599646, 'std_rel_rmse_e1': 544.2162301831149, 'rmse_e2': 0.009168282438918063, 'std_rmse_e2': 0.008922320388766475, 'rel_rmse_e2': 346.9258300561109, 'std_rel_rmse_e2': 346.38851755041367, 'rmse_R2_meanR2': 0.03669683267550114, 'std_rmse_R2_meanR2': 0.025587624399757175, 'pix_rmse': 9.126631e-05, 'pix_rmse_std': 2.2697419e-05, 'rel_pix_rmse': 6.096496433019638, 'rel_pix_rmse_std': 2.035357244312763,

'rmse_e1': 0.0241800061185133, 'std_rmse_e1': 0.014729668133574656, 'rel_rmse_e1': 550.9202217189359, 'std_rel_rmse_e1': 538.483941515418, 'rmse_e2': 0.00846779738001486, 'std_rmse_e2': 0.008098060249920496, 'rel_rmse_e2': 336.6814546254379, 'std_rel_rmse_e2': 332.86960115973426, 'rmse_R2_meanR2': 0.10814865342870499, 'std_rmse_R2_meanR2': 0.04149496242444061, 'pix_rmse': 9.000804e-05, 'pix_rmse_std': 2.2791739e-05, 'rel_pix_rmse': 6.022337824106216, 'rel_pix_rmse_std': 1.9995957612991333,

jeipollack commented 6 months ago

Can you confirm whether you rebuilt the package pip install . between changing optimisers? For added assurance, you can uninstall the wf-psf package, delete associated build directories, and reinstall with the following steps:

cd wf-psf
pip uninstall wf-psf -y
rm -rf build
cd src
rm -rf wf_psf.egg_info
cd ..
pip install .

Else, create two branches one with Adam and the other with Rectified Adam.

nadamoukaddem commented 5 months ago

wf-psf_Adam.log wf-psf_legacy.log I ran WaveDiff with TensorFlow 2.11 and the Adam Optimizer. The code started the first cycle, then there was an error (see log file). So I tried to use the legacy Adam optimizer, but then at the end of the cycles, there was another error.

jeipollack commented 5 months ago

The first log seems like a GPU issue. Not sure how you're loading Tensor Flow 2.11.

Second log states that you need to update the optimizer to be an instance of a TensorFlow 2.11+ compatible optimizer. This means replacing any legacy optimizers with their TensorFlow 2.11+ counterparts. And, what worked for 2.15 doesn't work for 2.11.

nadamoukaddem commented 5 months ago

If it's a GPU issue, I should have the same problem when using the legacy optimizer.

jeipollack commented 5 months ago

True, I am too busy right now to assist you. You can try looking online for some clues.

jeipollack commented 5 months ago

Have a look here: https://stackoverflow.com/questions/71153492/invalid-argument-error-graph-execution-error

But sorry at the moment, this is as much as I can help right now.

nadamoukaddem commented 5 months ago

Have a look here: https://stackoverflow.com/questions/71153492/invalid-argument-error-graph-execution-error

But sorry at the moment, this is as much as I can help right now.

I understand. Thank you. I'll take a look.

nadamoukaddem commented 5 months ago

psf_pytest_tf2.9.log psf_pytest_tf2.11.log These are the outputs of the validation tests for the 2 versions of TensorFlow (2.9.1 and 2.11).

jeipollack commented 5 months ago

If you look carefully at your log, you will see that the validation tests did not run. They were skipped.

jeipollack commented 5 months ago

And it seems you solved your TensorFlow 2.11 issue, but didn't update this issue with how you solved it. It's really important that you share the solution to a reported problem in the issue.

nadamoukaddem commented 5 months ago

No, I haven't solved it yet. I read (link) that the following steps can fix the error:

 # Install NVCC
 conda install -c nvidia cuda-nvcc=11.3.58
 # Configure the XLA cuda directory
 mkdir -p $CONDA_PREFIX/etc/conda/activate.d
 printf 'export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib/\n' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
 source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
 # Copy libdevice file to the required path
 mkdir -p $CONDA_PREFIX/lib/nvvm/libdevice
 cp $CONDA_PREFIX/lib/libdevice.10.bc $CONDA_PREFIX/lib/nvvm/libdevice/

I couldn't install it via Conda. Is this a good solution, so I keep working on it?

jeipollack commented 5 months ago

If you didn't solve it, I don't understand how you were able to run the tests for TensorFlow 2.11 unless you ran it on a different system.

You can submit a ticket to Jean-Zay/Idris Support.

nadamoukaddem commented 5 months ago

Following the instructions of Idris support, the problem was resolved by adding these lines:

export CUDA_DIR=$CUDA_HOME export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CUDA_HOME

nadamoukaddem commented 5 months ago

WaveDiff runs without error using TensorFlow 2.11 and Adam optimizer, but with Rectified and the legacy optimizer, there is this error at the end of the cycles: ValueError: You are trying to restore a checkpoint from a legacy Keras optimizer into a v2.11+ Optimizer, which can cause errors. Please update the optimizer referenced in your code to be an instance of

jeipollack commented 5 months ago

The message is cut-off. Are you stuck here and unsure how to update the optimiser?

nadamoukaddem commented 5 months ago

Aren't we looking to compare the performance of the Adam optimizer with that of the Rectified optimizer in this issue ?

jeipollack commented 5 months ago

The complete error message was already reported in #88 by Ezequiel. I tried to reproduce it but for me I don't encounter this error when I launched some test runs. Below is my implementation

pyproject.toml

[project]
name = "wf_psf"
requires-python = ">=3.9"
authors = [
    { "name" = "Tobias Liaudat", "email" = "tobiasliaudat@gmail.com"},
    { "name" = "Jennifer Pollack", "email" = "jennifer.pollack@cea.fr"},
]
maintainers = [
    { "name" = "Jennifer Pollack", "email" = "jennifer.pollack@cea.fr" },
]

description = 'A software framework to perform Differentiable wavefront-based PSF modelling.'
dependencies = [
    "numpy",
    "scipy",
    "keras",
    "tensorflow",
    "tensorflow-addons",
    "tensorflow-estimator",
    "zernike",
    "opencv-python",
    "pillow",
    "galsim",
    "astropy",
    "matplotlib",
    "seaborn",
]

I looked at the TensorFlow-Addons documentation and issues. Their optimizer points to the tf.keras.optimizer.legacy.Optimizer, which is the legacy optimiser. So, the WaveDiff implementation of RectifiedAdam remains as is:

  # Prepare the optimizers
        param_optim = tfa.optimizers.RectifiedAdam(
            learning_rate=training_handler.learning_rate_params[current_cycle - 1]
        )
        non_param_optim = tfa.optimizers.RectifiedAdam(
            learning_rate=training_handler.learning_rate_non_params[current_cycle - 1]
        )
        logger.info("Starting cycle {}..".format(current_cycle))

I went through and changed all tf.keras.optimizer.Adam -> tf.keras.optimizer.legacy.Adam

I decided to use git diff to show the file changes.

diff --git a/src/wf_psf/psf_models/psf_models.py b/src/wf_psf/psf_models/psf_models.py
index 262faaa..f22de1a 100644
--- a/src/wf_psf/psf_models/psf_models.py
+++ b/src/wf_psf/psf_models/psf_models.py
@@ -164,7 +164,7 @@ def build_PSF_model(model_inst, optimizer=None, loss=None, metrics=None):

     # Define optimizer function
     if optimizer is None:
-        optimizer = tf.keras.optimizers.Adam(
+        optimizer = tf.keras.optimizers.legacy.Adam(
             learning_rate=1e-2, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False
         )

and did these replacements (although I don't think they are used)

diff --git a/src/wf_psf/training/train_utils.py b/src/wf_psf/training/train_utils.py
index 8c1036a..49fc2c5 100644
--- a/src/wf_psf/training/train_utils.py
+++ b/src/wf_psf/training/train_utils.py
@@ -160,7 +160,7 @@ def general_train_cycle(

     # Define optimisers
     if param_optim is None:
-        optimizer = tf.keras.optimizers.Adam(
+        optimizer = tf.keras.optimizers.legacy.Adam(
             learning_rate=learning_rate_param,
             beta_1=0.9,
             beta_2=0.999,
@@ -289,7 +289,7 @@ def general_train_cycle(

         # Define optimiser
         if non_param_optim is None:
-            optimizer = tf.keras.optimizers.Adam(
+            optimizer = tf.keras.optimizers.legacy.Adam(
                 learning_rate=learning_rate_non_param,
                 beta_1=0.9,
                 beta_2=0.999,
@@ -364,7 +364,7 @@ def param_train_cycle(

     # Define optimiser
     if param_optim is None:
-        optimizer = tf.keras.optimizers.Adam(
+        optimizer = tf.keras.optimizers.legacy.Adam(
             learning_rate=learning_rate,
             beta_1=0.9,
             beta_2=0.999,

Could you try this and report an update?

nadamoukaddem commented 5 months ago

I don't have the build_PSF_model function in psf_models.py. Is it in the develop branch?

jeipollack commented 5 months ago

No, either way you can search for it in your branch and modify it there.

nadamoukaddem commented 5 months ago

This function doesn't exist. I can work on what you mentioned yesterday in the meeting so we can discuss this issue later.

jeipollack commented 5 months ago

it does exist: https://github.com/CosmoStat/wf-psf/blob/87e0c8e9770199cd276f5f0551054cb4902d53bb/src/wf_psf/psf_models/tf_psf_field.py#L1214

nadamoukaddem commented 5 months ago

ok, so it's in tf_psf_field.py and not in psf_models.py.

jeipollack commented 5 months ago

yes, I meant for you to search for the function in your branch to find what module it is in. Apologies if that wasn't clear to you.

I've done some refactoring in the branch where I tested and I moved the function to psf_models.py. That's related to the new task I asked you to work on yesterday. But, I decided I will work on it myself.

nadamoukaddem commented 5 months ago

Thank you. I tried the changes that you made in the tf_psf_field.py file, and I obtained the same results for comparing the two optimizers with Tensorflow 2.9 and 2.11. I extracted the metrics from the metrics-poly-coherent_euclid_200stars.npy file and used these training configuration parameters training_config.log

Screenshot from 2024-03-28 12-20-48

jeipollack commented 5 months ago

Thanks. can you work on a Pull Request to perform an update to TensorFlow to 2.11 with Rectified Adam optimiser as well as associated package dependencies, i.e. Keras, TensorFlow-Addons, etc?

Make sure to run the validation tests on training and metrics, which you have to run locally with pytest as it cannot run during CI.

nadamoukaddem commented 5 months ago

The train_test is failing even though I can run WaveDiff normally. psf_pytest.log

jeipollack commented 5 months ago

I am unable to reproduce your error nor have I ever encountered it.

My steps were:

See the output here:psf_pytest1.txt

jeipollack commented 5 months ago

Hi Nada, I can open the PR since I was able to implement the needed change without an issue.

Could you work on #133 which is more urgently needed? Let Tobias and me know if you have questions by either asking directly in #133 or in #sgs-sdc-fr-psf.

nadamoukaddem commented 5 months ago

Hi Jennifer, Ok.