Open jeipollack opened 9 months ago
After updating the optimizer to the legacy version, running the validation test on shape metrics resulted in the following error:
assert ratio_rmse_e1 < tol
E assert 2.450450287128092e-09 < 1e-09
Lowering the tolerance to 1e-8 gives this error:
assert ratio_rel_rmse_e1 < tol
E assert 1.647135320670401e-05 < 1e-08
I evaluated the performance using the Adam Optimizer versus the Rectified Adam Optimizer, and here are the results I obtained: I used these as the number of training epochs:
Number of training epochs for training the parametric model parameters per cycle.
n_epochs_params: [15, 15]
Number of training epochs for training the non-parametric model parameters per cycle.
n_epochs_non_params: [100, 50]
Thanks @nadamoukaddem for this update. Just to confirm did you use the same random seed and learning rates for both runs? Perhaps you could share the complete training configuration file parameters.
Yes, I used the same random seed and learning rates for both runs. Here is the training_config.yaml file:
training: # Run ID name id_name: -coherent_euclid_200stars # Name of Data Config file data_config: data_config.yaml # Metrics Config file - Enter file to run metrics evaluation else if empty run train only metrics_config: metrics_config.yaml # PSF model parameters model_params: # Model type. Options are: 'mccd', 'graph', 'poly, 'param', 'poly_physical'." model_name: poly
\#Num of wavelength bins to reconstruct polychromatic objects.
n_bins_lda: 8
\#Downsampling rate to match the oversampled model to the specified telescope's sampling.
output_Q: 3
\#Oversampling rate used for the OPD/WFE PSF model.
oversampling_rate: 3
\#Dimension of the pixel PSF postage stamp
output_dim: 32
\#Dimension of the OPD/Wavefront space."
pupil_diameter: 256
\#Boolean to define if we use sample weights based on the noise standard deviation estimation
use_sample_weights: True
\#Interpolation type for the physical poly model. Options are: 'none', 'all', 'top_K', 'independent_Zk'."
interpolation_type: None
\# SED intepolation points per bin
sed_interp_pts_per_bin: 0
\# SED extrapolate
sed_extrapolate: True
\# SED interpolate kind
sed_interp_kind: linear
\# Standard deviation of the multiplicative SED Gaussian noise.
sed_sigma: 0
\#Limits of the PSF field coordinates for the x axis.
x_lims: [0.0, 1.0e+3]
\#Limits of the PSF field coordinates for the y axis.
y_lims: [0.0, 1.0e+3]
\# Hyperparameters for Parametric model
param_hparams:
\# Set the random seed for Tensor Flow Initialization
random_seed: 3877572
\# Set the parameter for the l2 loss function for the Optical path differences (OPD)/WFE
l2_param: 0.
\#Zernike polynomial modes to use on the parametric part.
n_zernikes: 15
\#Max polynomial degree of the parametric part. m
d_max: 2
\#Flag to save optimisation history for parametric model
save_optim_history_param: true
\# Hyperparameters for non-parametric model
nonparam_hparams:
\#Max polynomial degree of the non-parametric part. chg to max_deg_nonparam
d_max_nonparam: 5
\# Number of graph features
num_graph_features: 10
\#L1 regularisation parameter for the non-parametric part."
l1_rate: 1.0e-8
\#Flag to enable Projected learning for DD_features to be used with `poly` or `semiparametric` model.
project_dd_features: False
\#Flag to reset DD_features to be used with `poly` or `semiparametric` model
reset_dd_features: False
\#Flag to save optimisation history for non-parametric model
save_optim_history_nonparam: true
# Training hyperparameters training_hparams:
\# Batch Size
batch_size: 32
\# Multi-cyclic Parameters
multi_cycle_params:
\# Total number of cycles to perform for training.
total_cycles: 2
\# Train cycle definition. It can be: 'parametric', 'non-parametric', 'complete', 'only-non-parametric' and 'only-parametric'."
cycle_def: complete
\# Flag to save all cycles. If "True", create a checkpoint at every cycle, else if "False" only save the checkpoint at the end of the training."
save_all_cycles: True
\# Learning rates for training the parametric model parameters per cycle.
learning_rate_params: [0.01, 0.004]
\# Learning rates for training the non-parametric model parameters per cycle.
learning_rate_non_params: [0.1, 0.06]
\# Number of training epochs for training the parametric model parameters per cycle.
n_epochs_params: [15, 15]
\# Number of training epochs for training the non-parametric model parameters per cycle.
n_epochs_non_params: [100, 50]
I remember I used the same TensorFlow 2.15.0 version for both optimizers. It's been a while since I ran these tests, I need to rerun them.
Okay. Could you run the tests 2-3 more times using different random numbers to see if the results are consistent? Make sure to use the same set of random numbers for both optimizers.
I ran the tests twice with different random numbers, and the metrics changed slightly. However, I noticed something strange: when I used Adam or Rectified Adam, I'm getting the same numbers for the metrics. I'm using TensorFlow 2.15.0
I don't understand what you mean as your first statement (metrics differ) seems contrary to the second statement (metrics are the same). Try explaining more clearly and post the outputs.
I have included the configurations I used in the attached text document. These include configs.yaml, training_config.yaml, and training_config_1.yaml. I didn't make any changes to the other configuration files. configurations.odt This is the output i get for adam and rectified adam optimizers for 2 random numbers:
Rectified Adam
'rmse_e1': 0.023490250341625767, 'std_rmse_e1': 0.015843164999311765, 'rel_rmse_e1': 555.6691819599646, 'std_rel_rmse_e1': 544.2162301831149, 'rmse_e2': 0.009168282438918063, 'std_rmse_e2': 0.008922320388766475, 'rel_rmse_e2': 346.9258300561109, 'std_rel_rmse_e2': 346.38851755041367, 'rmse_R2_meanR2': 0.03669683267550114, 'std_rmse_R2_meanR2': 0.025587624399757175, 'pix_rmse': 9.126631e-05, 'pix_rmse_std': 2.2697419e-05, 'rel_pix_rmse': 6.096496433019638, 'rel_pix_rmse_std': 2.035357244312763,
'rmse_e1': 0.0241800061185133, 'std_rmse_e1': 0.014729668133574656, 'rel_rmse_e1': 550.9202217189359, 'std_rel_rmse_e1': 538.483941515418, 'rmse_e2': 0.00846779738001486, 'std_rmse_e2': 0.008098060249920496, 'rel_rmse_e2': 336.6814546254379, 'std_rel_rmse_e2': 332.86960115973426, 'rmse_R2_meanR2': 0.10814865342870499, 'std_rmse_R2_meanR2': 0.04149496242444061, 'pix_rmse': 9.000804e-05, 'pix_rmse_std': 2.2791739e-05, 'rel_pix_rmse': 6.022337824106216, 'rel_pix_rmse_std': 1.9995957612991333,
Adam
rmse_e1': 0.023490250341625767, 'std_rmse_e1': 0.015843164999311765, 'rel_rmse_e1': 555.6691819599646, 'std_rel_rmse_e1': 544.2162301831149, 'rmse_e2': 0.009168282438918063, 'std_rmse_e2': 0.008922320388766475, 'rel_rmse_e2': 346.9258300561109, 'std_rel_rmse_e2': 346.38851755041367, 'rmse_R2_meanR2': 0.03669683267550114, 'std_rmse_R2_meanR2': 0.025587624399757175, 'pix_rmse': 9.126631e-05, 'pix_rmse_std': 2.2697419e-05, 'rel_pix_rmse': 6.096496433019638, 'rel_pix_rmse_std': 2.035357244312763,
'rmse_e1': 0.0241800061185133, 'std_rmse_e1': 0.014729668133574656, 'rel_rmse_e1': 550.9202217189359, 'std_rel_rmse_e1': 538.483941515418, 'rmse_e2': 0.00846779738001486, 'std_rmse_e2': 0.008098060249920496, 'rel_rmse_e2': 336.6814546254379, 'std_rel_rmse_e2': 332.86960115973426, 'rmse_R2_meanR2': 0.10814865342870499, 'std_rmse_R2_meanR2': 0.04149496242444061, 'pix_rmse': 9.000804e-05, 'pix_rmse_std': 2.2791739e-05, 'rel_pix_rmse': 6.022337824106216, 'rel_pix_rmse_std': 1.9995957612991333,
Can you confirm whether you rebuilt the package pip install .
between changing optimisers?
For added assurance, you can uninstall the wf-psf package, delete associated build directories, and reinstall with the following steps:
cd wf-psf
pip uninstall wf-psf -y
rm -rf build
cd src
rm -rf wf_psf.egg_info
cd ..
pip install .
Else, create two branches one with Adam and the other with Rectified Adam.
wf-psf_Adam.log wf-psf_legacy.log I ran WaveDiff with TensorFlow 2.11 and the Adam Optimizer. The code started the first cycle, then there was an error (see log file). So I tried to use the legacy Adam optimizer, but then at the end of the cycles, there was another error.
The first log seems like a GPU issue. Not sure how you're loading Tensor Flow 2.11.
Second log states that you need to update the optimizer to be an instance of a TensorFlow 2.11+ compatible optimizer. This means replacing any legacy optimizers with their TensorFlow 2.11+ counterparts. And, what worked for 2.15 doesn't work for 2.11.
If it's a GPU issue, I should have the same problem when using the legacy optimizer.
True, I am too busy right now to assist you. You can try looking online for some clues.
Have a look here: https://stackoverflow.com/questions/71153492/invalid-argument-error-graph-execution-error
But sorry at the moment, this is as much as I can help right now.
Have a look here: https://stackoverflow.com/questions/71153492/invalid-argument-error-graph-execution-error
But sorry at the moment, this is as much as I can help right now.
I understand. Thank you. I'll take a look.
psf_pytest_tf2.9.log psf_pytest_tf2.11.log These are the outputs of the validation tests for the 2 versions of TensorFlow (2.9.1 and 2.11).
If you look carefully at your log, you will see that the validation tests did not run. They were skipped.
And it seems you solved your TensorFlow 2.11 issue, but didn't update this issue with how you solved it. It's really important that you share the solution to a reported problem in the issue.
No, I haven't solved it yet. I read (link) that the following steps can fix the error:
# Install NVCC
conda install -c nvidia cuda-nvcc=11.3.58
# Configure the XLA cuda directory
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
printf 'export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib/\n' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
# Copy libdevice file to the required path
mkdir -p $CONDA_PREFIX/lib/nvvm/libdevice
cp $CONDA_PREFIX/lib/libdevice.10.bc $CONDA_PREFIX/lib/nvvm/libdevice/
I couldn't install it via Conda. Is this a good solution, so I keep working on it?
If you didn't solve it, I don't understand how you were able to run the tests for TensorFlow 2.11 unless you ran it on a different system.
You can submit a ticket to Jean-Zay/Idris Support.
Following the instructions of Idris support, the problem was resolved by adding these lines:
export CUDA_DIR=$CUDA_HOME
export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CUDA_HOME
WaveDiff runs without error using TensorFlow 2.11 and Adam optimizer, but with Rectified and the legacy optimizer, there is this error at the end of the cycles:
ValueError: You are trying to restore a checkpoint from a legacy Keras optimizer into a v2.11+ Optimizer, which can cause errors. Please update the optimizer referenced in your code to be an instance of
The message is cut-off. Are you stuck here and unsure how to update the optimiser?
Aren't we looking to compare the performance of the Adam optimizer with that of the Rectified optimizer in this issue ?
The complete error message was already reported in #88 by Ezequiel. I tried to reproduce it but for me I don't encounter this error when I launched some test runs. Below is my implementation
pyproject.toml
[project]
name = "wf_psf"
requires-python = ">=3.9"
authors = [
{ "name" = "Tobias Liaudat", "email" = "tobiasliaudat@gmail.com"},
{ "name" = "Jennifer Pollack", "email" = "jennifer.pollack@cea.fr"},
]
maintainers = [
{ "name" = "Jennifer Pollack", "email" = "jennifer.pollack@cea.fr" },
]
description = 'A software framework to perform Differentiable wavefront-based PSF modelling.'
dependencies = [
"numpy",
"scipy",
"keras",
"tensorflow",
"tensorflow-addons",
"tensorflow-estimator",
"zernike",
"opencv-python",
"pillow",
"galsim",
"astropy",
"matplotlib",
"seaborn",
]
I looked at the TensorFlow-Addons documentation and issues. Their optimizer points to the tf.keras.optimizer.legacy.Optimizer
, which is the legacy optimiser. So, the WaveDiff implementation of RectifiedAdam
remains as is:
# Prepare the optimizers
param_optim = tfa.optimizers.RectifiedAdam(
learning_rate=training_handler.learning_rate_params[current_cycle - 1]
)
non_param_optim = tfa.optimizers.RectifiedAdam(
learning_rate=training_handler.learning_rate_non_params[current_cycle - 1]
)
logger.info("Starting cycle {}..".format(current_cycle))
I went through and changed all tf.keras.optimizer.Adam
-> tf.keras.optimizer.legacy.Adam
I decided to use git diff
to show the file changes.
diff --git a/src/wf_psf/psf_models/psf_models.py b/src/wf_psf/psf_models/psf_models.py
index 262faaa..f22de1a 100644
--- a/src/wf_psf/psf_models/psf_models.py
+++ b/src/wf_psf/psf_models/psf_models.py
@@ -164,7 +164,7 @@ def build_PSF_model(model_inst, optimizer=None, loss=None, metrics=None):
# Define optimizer function
if optimizer is None:
- optimizer = tf.keras.optimizers.Adam(
+ optimizer = tf.keras.optimizers.legacy.Adam(
learning_rate=1e-2, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False
)
and did these replacements (although I don't think they are used)
diff --git a/src/wf_psf/training/train_utils.py b/src/wf_psf/training/train_utils.py
index 8c1036a..49fc2c5 100644
--- a/src/wf_psf/training/train_utils.py
+++ b/src/wf_psf/training/train_utils.py
@@ -160,7 +160,7 @@ def general_train_cycle(
# Define optimisers
if param_optim is None:
- optimizer = tf.keras.optimizers.Adam(
+ optimizer = tf.keras.optimizers.legacy.Adam(
learning_rate=learning_rate_param,
beta_1=0.9,
beta_2=0.999,
@@ -289,7 +289,7 @@ def general_train_cycle(
# Define optimiser
if non_param_optim is None:
- optimizer = tf.keras.optimizers.Adam(
+ optimizer = tf.keras.optimizers.legacy.Adam(
learning_rate=learning_rate_non_param,
beta_1=0.9,
beta_2=0.999,
@@ -364,7 +364,7 @@ def param_train_cycle(
# Define optimiser
if param_optim is None:
- optimizer = tf.keras.optimizers.Adam(
+ optimizer = tf.keras.optimizers.legacy.Adam(
learning_rate=learning_rate,
beta_1=0.9,
beta_2=0.999,
Could you try this and report an update?
I don't have the build_PSF_model
function in psf_models.py
. Is it in the develop branch?
No, either way you can search for it in your branch and modify it there.
This function doesn't exist. I can work on what you mentioned yesterday in the meeting so we can discuss this issue later.
ok, so it's in tf_psf_field.py
and not in psf_models.py
.
yes, I meant for you to search for the function in your branch to find what module it is in. Apologies if that wasn't clear to you.
I've done some refactoring in the branch where I tested and I moved the function to psf_models.py
. That's related to the new task I asked you to work on yesterday. But, I decided I will work on it myself.
Thank you. I tried the changes that you made in the tf_psf_field.py
file, and I obtained the same results for comparing the two optimizers with Tensorflow 2.9 and 2.11. I extracted the metrics from the metrics-poly-coherent_euclid_200stars.npy
file and used these training configuration parameters training_config.log
Thanks. can you work on a Pull Request to perform an update to TensorFlow to 2.11 with Rectified Adam optimiser as well as associated package dependencies, i.e. Keras, TensorFlow-Addons, etc?
Make sure to run the validation tests on training and metrics, which you have to run locally with pytest
as it cannot run during CI.
The train_test is failing even though I can run WaveDiff normally. psf_pytest.log
I am unable to reproduce your error nor have I ever encountered it.
My steps were:
develop
See the output here:psf_pytest1.txt
Hi Nada, I can open the PR since I was able to implement the needed change without an issue.
Could you work on #133 which is more urgently needed? Let Tobias and me know if you have questions by either asking directly in #133 or in #sgs-sdc-fr-psf.
Hi Jennifer, Ok.
WaveDiff implements the Rectified Adam Optimiser from the TensorFlow Addons library that has stopped development (see details in link). Minimal maintenance releases will continue until May 2024. As a result, it is not compatible with the latest versions of TensorFlow 2.11+. Interestingly, 2.9+ is also affected resulting in the following error reported in Issue #88 which results when loading a saved checkpoint. The Rectified Adam Optimiser is currently not part of the core library of TensorFlow 2. The fix is use the
tf.keras.optimizers.legacy
namespace (see here) that allows the old optimizers to work.This issue to do a couple of tasks: