Open CentofantiEze opened 7 months ago
About the last statement:
I believe WaveDiff correctly loads the pre-trained model and this is just an issue of overwritten log files. Maybe the physical_val-22_no_est
finished last the training so its content appear on the training part of the log file, and the physical_val-20_no_est
who picks up the last version of the log file finished last the metrics evaluation, thus having log files that mix both models.
Here it is the jean-zay output file for the job physical_val-22_no_est
where it can be seen that the correct model is loaded for computing the metrics.
usr@jean-zay3: jobs]$ cat physical_val-22_no_est.out
2024-04-23 18:37:48,144 - wavediff - INFO - #
2024-04-23 18:37:48,147 - wavediff - INFO - # Entering wavediff mainMethod()
2024-04-23 18:37:48,147 - wavediff - INFO - #
--->2024-04-23 18:37:48,148 - wf_psf.utils.read_config - INFO - Loading.../gpfswork/rech/ynx/uds36vp/repos/physical_layer/configfiles/training_config_22_no_est.yaml
--->2024-04-23 18:37:48,179 - wf_psf.utils.read_config - INFO - Loading.../gpfswork/rech/ynx/uds36vp/repos/physical_layer/configfiles/data_config_22.yaml
2024-04-23 18:37:55,274 - wavediff - INFO - <wf_psf.utils.configs_handler.TrainingConfigHandler object at 0x153074666fb0>
2024-04-23 18:37:55,694 - wf_psf.training.train - INFO - PSF Model class: `poly` initialized...
2024-04-23 18:37:55,694 - wf_psf.training.train - INFO - Preparing Keras model callback...
2024-04-23 18:37:55,694 - wf_psf.training.train - INFO - Preparing Keras model callback...
2024-04-23 18:37:55,695 - wf_psf.training.train - INFO - Starting cycle 1..
2024-04-23 18:38:00,484 - wf_psf.training.train_utils - INFO - Starting parametric update..
Epoch 1/20
Epoch 1: mean_squared_error improved from inf to 0.00003, saving model to /gpfswork/rech/ynx/uds36vp/repos/physical_layer/output/wf-outputs/wf-outputs-202404231837/checkpoint/checkpoint_callback_polyphysical_val-22_no_est_cycle1
63/63 - 166s - loss: 3.3731e-05 - mean_squared_error: 3.2803e-05 - val_loss: 2.7561e-05 - val_mean_squared_error: 2.7561e-05 - 166s/epoch - 3s/step
Epoch 2/20
...
Epoch 118: mean_squared_error did not improve from 0.00003
63/63 - 159s - loss: 2.8170e-05 - mean_squared_error: 3.0134e-05 - val_loss: 2.6701e-05 - val_mean_squared_error: 2.6701e-05 - 159s/epoch - 3s/step
Epoch 119/120
Epoch 119: mean_squared_error did not improve from 0.00003
63/63 - 159s - loss: 2.8556e-05 - mean_squared_error: 3.0338e-05 - val_loss: 2.7635e-05 - val_mean_squared_error: 2.7635e-05 - 159s/epoch - 3s/step
Epoch 120/120
Epoch 120: mean_squared_error did not improve from 0.00003
63/63 - 159s - loss: 2.8431e-05 - mean_squared_error: 3.0429e-05 - val_loss: 2.6650e-05 - val_mean_squared_error: 2.6650e-05 - 159s/epoch - 3s/step
2024-04-24 06:05:07,577 - wf_psf.training.train - INFO - Cycle2 elapsed time: 22212.329714536667
2024-04-24 06:05:07,727 - wf_psf.training.train - INFO -
Total elapsed time: 41232.445971
2024-04-24 06:05:07,728 - wf_psf.training.train - INFO -
Training complete..
--->2024-04-24 06:05:07,769 - wf_psf.utils.read_config - INFO - Loading.../gpfswork/rech/ynx/uds36vp/repos/physical_layer/configfiles/metrics_config_no_est.yaml
2024-04-24 06:05:07,780 - wf_psf.utils.configs_handler - INFO - Running metrics evaluation on psf model: /gpfswork/rech/ynx/uds36vp/repos/physical_layer/output/wf-outputs/wf-outputs-202404231837/psf_model/psf_model_polyphysical_val-22_no_est_cycle2
--->2024-04-24 06:05:07,780 - wf_psf.utils.read_config - INFO - Loading.../gpfswork/rech/ynx/uds36vp/repos/physical_layer/configfiles/data_config_22.yaml
--->2024-04-24 06:05:11,733 - wf_psf.utils.read_config - INFO - Loading.../gpfswork/rech/ynx/uds36vp/repos/physical_layer/configfiles/data_config_22.yaml
2024-04-24 06:05:15,826 - wf_psf.metrics.metrics_interface - INFO - Fetching and preprocessing training and test data...
--->2024-04-24 06:05:15,828 - wf_psf.metrics.metrics_interface - INFO - Loading PSF model weights from /gpfswork/rech/ynx/uds36vp/repos/physical_layer/output/wf-outputs/wf-outputs-202404231837/psf_model/psf_model_polyphysical_val-22_no_est_cycle2
2024-04-24 06:05:15,905 - wf_psf.metrics.metrics_interface - INFO -
***
Metric evaluation on the test dataset
***
2024-04-24 06:05:15,905 - wf_psf.metrics.metrics_interface - INFO - Computing polychromatic metrics at low resolution.
38/38 [==============================] - 24s 628ms/step
2024-04-24 06:05:41,379 - wf_psf.metrics.metrics - INFO - Using Ground Truth stars from dataset.
2024-04-24 06:05:41,414 - wf_psf.metrics.metrics - INFO - Absolute RMSE: 4.7764e-03 +/- 1.9585e-03
2024-04-24 06:05:41,414 - wf_psf.metrics.metrics - INFO - Relative RMSE: 4.8935e+01 % +/- 1.4269e+01 %
2024-04-24 06:05:41,416 - wf_psf.metrics.metrics_interface - INFO - Computing monochromatic metrics.
2
...
When running (training+metrics) multiple models in parallel (by submitting various jobs on HPC systems) the log files are overwritten by the last model.
If multiple jobs start at the same time, their outputs will be stored in the same
wf-outputs-YYYYMMDDHHmm
folder and every log-file will have the same name (wf-psf_YYYYMMDDHHmm.log
) and will be saved by WaveDiff in the samelog-files
directory.Moreover, when computing the metrics WaveDiff will look for a trained model in the
wf-outputs/wf-outputs-YYYYMMDDHHmm/psf_model
directory and it might pick the first one in alphabetical order that meet themetrics_config.yaml
pre-trained model parameters (model_save_path
,saved_training_cycle
, etc).I am not entirely sure about this last statement but here it is the log file I've got when running three different models (ids:
physical_val-20_no_est
,physical_val-21_no_est
,physical_val-22_no_est
) on JeanZay. All three jobs started at the same time due to queueing.I have marked with an arrow the lines where the configfiles and the trained model are loaded.