Overwritten log-files - Githubissues

When running (training+metrics) multiple models in parallel (by submitting various jobs on HPC systems) the log files are overwritten by the last model.

If multiple jobs start at the same time, their outputs will be stored in the same wf-outputs-YYYYMMDDHHmm folder and every log-file will have the same name (wf-psf_YYYYMMDDHHmm.log) and will be saved by WaveDiff in the same log-files directory.

Moreover, when computing the metrics WaveDiff will look for a trained model in the wf-outputs/wf-outputs-YYYYMMDDHHmm/psf_model directory and it might pick the first one in alphabetical order that meet the metrics_config.yaml pre-trained model parameters (model_save_path, saved_training_cycle, etc).

I am not entirely sure about this last statement but here it is the log file I've got when running three different models (ids: physical_val-20_no_est, physical_val-21_no_est, physical_val-22_no_est) on JeanZay. All three jobs started at the same time due to queueing.

I have marked with an arrow the lines where the configfiles and the trained model are loaded.

usr@jean-zay3: log-files]$ cat wf-psf_202404231837.log 
2024-04-23 18:37:48,144 - wavediff - INFO - #
2024-04-23 18:37:48,147 - wavediff - INFO - # Entering wavediff mainMethod()
2024-04-23 18:37:48,147 - wavediff - INFO - #
--->2024-04-23 18:37:48,148 - wf_psf.utils.read_config - INFO - Loading.../gpfswork/rech/ynx/uds36vp/repos/physical_layer/configfiles/training_config_22_no_est.yaml
--->2024-04-23 18:37:48,179 - wf_psf.utils.read_config - INFO - Loading.../gpfswork/rech/ynx/uds36vp/repos/physical_layer/configfiles/data_config_22.yaml
2024-04-23 18:37:55,274 - wavediff - INFO - <wf_psf.utils.configs_handler.TrainingConfigHandler object at 0x153074666fb0>
2024-04-23 18:37:55,694 - wf_psf.training.train - INFO - PSF Model class: `poly` initialized...
2024-04-23 18:37:55,694 - wf_psf.training.train - INFO - Preparing Keras model callback...
2024-04-23 18:37:55,694 - wf_psf.training.train - INFO - Preparing Keras model callback...
2024-04-23 18:37:55,695 - wf_psf.training.train - INFO - Starting cycle 1..
2024-04-23 18:38:00,484 - wf_psf.training.train_utils - INFO - Starting parametric update..
2024-04-23 19:33:46,498 - wf_psf.training.train_utils - INFO - Starting non-parametric update..
2024-04-24 00:09:16,817 - wf_psf.training.train - INFO - Cycle1 elapsed time: 19881.859013319016
2024-04-24 00:09:16,819 - wf_psf.training.train - INFO - Preparing Keras model callback...
2024-04-24 00:09:16,819 - wf_psf.training.train - INFO - Preparing Keras model callback...
2024-04-24 00:09:16,820 - wf_psf.training.train - INFO - Starting cycle 2..
2024-04-24 00:09:21,585 - wf_psf.training.train_utils - INFO - Starting parametric update..
2024-04-24 01:02:24,634 - wf_psf.training.train_utils - INFO - Starting non-parametric update..
2024-04-24 06:18:27,868 - wf_psf.training.train - INFO - Cycle2 elapsed time: 22151.047401189804
2024-04-24 06:18:28,021 - wf_psf.training.train - INFO - 
Total elapsed time: 42033.349693
2024-04-24 06:18:28,021 - wf_psf.training.train - INFO - 
 Training complete..
2024-04-24 06:18:28,070 - wf_psf.utils.read_config - INFO - Loading.../gpfswork/rech/ynx/uds36vp/repos/physical_layer/configfiles/metrics_config_no_est.yaml
--->2024-04-24 06:18:28,081 - wf_psf.utils.configs_handler - INFO - Running metrics evaluation on psf model: /gpfswork/rech/ynx/uds36vp/repos/physical_layer/output/wf-outputs/wf-outputs-202404231837/psf_model/psf_model_polyphysical_val-20_no_est_cycle2
--->2024-04-24 06:18:28,081 - wf_psf.utils.read_config - INFO - Loading.../gpfswork/rech/ynx/uds36vp/repos/physical_layer/configfiles/data_config_20.yaml
--->2024-04-24 06:18:32,024 - wf_psf.utils.read_config - INFO - Loading.../gpfswork/rech/ynx/uds36vp/repos/physical_layer/configfiles/data_config_20.yaml
2024-04-24 06:18:36,557 - wf_psf.metrics.metrics_interface - INFO - Fetching and preprocessing training and test data...
--->2024-04-24 06:18:36,558 - wf_psf.metrics.metrics_interface - INFO - Loading PSF model weights from /gpfswork/rech/ynx/uds36vp/repos/physical_layer/output/wf-outputs/wf-outputs-202404231837/psf_model/psf_model_polyphysical_val-20_no_est_cycle2
2024-04-24 06:18:36,725 - wf_psf.metrics.metrics_interface - INFO - 
***
Metric evaluation on the test dataset
***

2024-04-24 06:18:36,725 - wf_psf.metrics.metrics_interface - INFO - Computing polychromatic metrics at low resolution.
2024-04-24 06:19:01,374 - wf_psf.metrics.metrics - INFO - Using Ground Truth stars from dataset.
2024-04-24 06:19:01,417 - wf_psf.metrics.metrics - INFO - Absolute RMSE:     6.1058e-03      +/- 2.0302e-03
2024-04-24 06:19:01,417 - wf_psf.metrics.metrics - INFO - Relative RMSE:     5.2840e+01 %    +/- 1.6124e+01 %
2024-04-24 06:19:01,420 - wf_psf.metrics.metrics_interface - INFO - Computing monochromatic metrics.
2024-04-24 06:19:01,610 - tensorflow - WARNING - Setting parallel_iterations > 1 has no effect when executing eagerly. Consider calling map_fn with tf.function to execute fn in parallel.
2024-04-24 06:26:32,223 - wf_psf.metrics.metrics_interface - INFO - Computing OPD metrics.
2024-04-24 06:26:33,354 - wf_psf.metrics.metrics - INFO - Absolute RMSE:     5.3780e-01      +/- 2.8197e-01
2024-04-24 06:26:33,354 - wf_psf.metrics.metrics - INFO - Relative RMSE:     7.5470e+03 %    +/- 5.7780e+03 %
2024-04-24 06:26:33,387 - wf_psf.metrics.metrics_interface - INFO - Computing Shape metrics.
2024-04-24 06:26:58,113 - wf_psf.metrics.metrics - INFO - Generating gt super resolved stars from the gt model.
2024-04-24 06:27:22,426 - wf_psf.metrics.metrics - INFO - 
Pixel star absolute RMSE:    7.8755e-04      +/- 2.6395e-04 
2024-04-24 06:27:22,426 - wf_psf.metrics.metrics - INFO - Pixel star relative RMSE:  3.4511e+01 %    +/- 1.1345e+01 %
2024-04-24 06:27:22,982 - wf_psf.metrics.metrics - INFO - 
sigma(e1) RMSE =         2.8343e-02      +/- 9.7729e-03 
2024-04-24 06:27:22,982 - wf_psf.metrics.metrics - INFO - sigma(e2) RMSE =       9.8861e-03      +/- 8.6327e-03 
2024-04-24 06:27:22,983 - wf_psf.metrics.metrics - INFO - sigma(R2)/<R2> =       3.9576e-01      +/- 3.7685e-02 
2024-04-24 06:27:22,983 - wf_psf.metrics.metrics - INFO - 
Relative sigma(e1) RMSE =    1.4309e+03 %    +/- 6.5168e+02 %
2024-04-24 06:27:22,983 - wf_psf.metrics.metrics - INFO - Relative sigma(e2) RMSE =  2.3022e+04 %    +/- 2.2662e+04 %
2024-04-24 06:27:22,983 - wf_psf.metrics.metrics - INFO - 
Total number of stars:      601
2024-04-24 06:27:22,983 - wf_psf.metrics.metrics - INFO - Problematic number of stars:  0
2024-04-24 06:27:22,997 - wf_psf.metrics.metrics_interface - INFO - 
***
Metric evaluation on the train dataset
***

2024-04-24 06:27:22,998 - wf_psf.metrics.metrics_interface - INFO - Computing polychromatic metrics at low resolution.
2024-04-24 06:28:43,595 - wf_psf.metrics.metrics - INFO - Using Ground Truth stars from dataset.
2024-04-24 06:28:43,606 - wf_psf.metrics.metrics - INFO - Absolute RMSE:     5.9484e-03      +/- 1.8862e-03
2024-04-24 06:28:43,606 - wf_psf.metrics.metrics - INFO - Relative RMSE:     5.1602e+01 %    +/- 1.5037e+01 %
2024-04-24 06:28:43,612 - wf_psf.metrics.metrics_interface - INFO - Computing monochromatic metrics.
2024-04-24 06:53:41,109 - wf_psf.metrics.metrics_interface - INFO - Computing OPD metrics.
2024-04-24 06:53:44,454 - wf_psf.metrics.metrics - INFO - Absolute RMSE:     5.3642e-01      +/- 2.9465e-01
2024-04-24 06:53:44,454 - wf_psf.metrics.metrics - INFO - Relative RMSE:     7.5907e+03 %    +/- 5.6804e+03 %
2024-04-24 06:53:44,454 - wf_psf.metrics.metrics_interface - INFO - Computing Shape metrics.
2024-04-24 06:55:04,635 - wf_psf.metrics.metrics - INFO - Generating gt super resolved stars from the gt model.
2024-04-24 06:56:25,012 - wf_psf.metrics.metrics - INFO - 
Pixel star absolute RMSE:    7.7334e-04      +/- 2.5727e-04 
2024-04-24 06:56:25,012 - wf_psf.metrics.metrics - INFO - Pixel star relative RMSE:  3.3941e+01 %    +/- 1.1033e+01 %
2024-04-24 06:56:26,601 - wf_psf.metrics.metrics - INFO - 
sigma(e1) RMSE =         2.8325e-02      +/- 1.0050e-02 
2024-04-24 06:56:26,601 - wf_psf.metrics.metrics - INFO - sigma(e2) RMSE =       9.7043e-03      +/- 8.6759e-03 
2024-04-24 06:56:26,601 - wf_psf.metrics.metrics - INFO - sigma(R2)/<R2> =       3.9623e-01      +/- 4.0439e-02 
2024-04-24 06:56:26,601 - wf_psf.metrics.metrics - INFO - 
Relative sigma(e1) RMSE =    1.5364e+03 %    +/- 8.2829e+02 %
2024-04-24 06:56:26,601 - wf_psf.metrics.metrics - INFO - Relative sigma(e2) RMSE =  3.4276e+04 %    +/- 3.4216e+04 %
2024-04-24 06:56:26,601 - wf_psf.metrics.metrics - INFO - 
Total number of stars:      2000
2024-04-24 06:56:26,601 - wf_psf.metrics.metrics - INFO - Problematic number of stars:  0
2024-04-24 06:56:26,652 - wf_psf.metrics.metrics_interface - INFO - 
Total elapsed time: 2270.095352
2024-04-24 06:56:26,653 - wf_psf.metrics.metrics_interface - INFO - 
 Good bye..
2024-04-24 06:56:26,655 - wavediff - INFO - #
2024-04-24 06:56:26,655 - wavediff - INFO - # Exiting wavediff mainMethod()
2024-04-24 06:56:26,655 - wavediff - INFO - #

About the last statement:

I believe WaveDiff correctly loads the pre-trained model and this is just an issue of overwritten log files. Maybe the physical_val-22_no_est finished last the training so its content appear on the training part of the log file, and the physical_val-20_no_est who picks up the last version of the log file finished last the metrics evaluation, thus having log files that mix both models.

Here it is the jean-zay output file for the job physical_val-22_no_est where it can be seen that the correct model is loaded for computing the metrics.

usr@jean-zay3: jobs]$ cat physical_val-22_no_est.out
2024-04-23 18:37:48,144 - wavediff - INFO - #
2024-04-23 18:37:48,147 - wavediff - INFO - # Entering wavediff mainMethod()
2024-04-23 18:37:48,147 - wavediff - INFO - #
--->2024-04-23 18:37:48,148 - wf_psf.utils.read_config - INFO - Loading.../gpfswork/rech/ynx/uds36vp/repos/physical_layer/configfiles/training_config_22_no_est.yaml
--->2024-04-23 18:37:48,179 - wf_psf.utils.read_config - INFO - Loading.../gpfswork/rech/ynx/uds36vp/repos/physical_layer/configfiles/data_config_22.yaml
2024-04-23 18:37:55,274 - wavediff - INFO - <wf_psf.utils.configs_handler.TrainingConfigHandler object at 0x153074666fb0>
2024-04-23 18:37:55,694 - wf_psf.training.train - INFO - PSF Model class: `poly` initialized...
2024-04-23 18:37:55,694 - wf_psf.training.train - INFO - Preparing Keras model callback...
2024-04-23 18:37:55,694 - wf_psf.training.train - INFO - Preparing Keras model callback...
2024-04-23 18:37:55,695 - wf_psf.training.train - INFO - Starting cycle 1..
2024-04-23 18:38:00,484 - wf_psf.training.train_utils - INFO - Starting parametric update..
Epoch 1/20

Epoch 1: mean_squared_error improved from inf to 0.00003, saving model to /gpfswork/rech/ynx/uds36vp/repos/physical_layer/output/wf-outputs/wf-outputs-202404231837/checkpoint/checkpoint_callback_polyphysical_val-22_no_est_cycle1
63/63 - 166s - loss: 3.3731e-05 - mean_squared_error: 3.2803e-05 - val_loss: 2.7561e-05 - val_mean_squared_error: 2.7561e-05 - 166s/epoch - 3s/step
Epoch 2/20

...

Epoch 118: mean_squared_error did not improve from 0.00003
63/63 - 159s - loss: 2.8170e-05 - mean_squared_error: 3.0134e-05 - val_loss: 2.6701e-05 - val_mean_squared_error: 2.6701e-05 - 159s/epoch - 3s/step
Epoch 119/120

Epoch 119: mean_squared_error did not improve from 0.00003
63/63 - 159s - loss: 2.8556e-05 - mean_squared_error: 3.0338e-05 - val_loss: 2.7635e-05 - val_mean_squared_error: 2.7635e-05 - 159s/epoch - 3s/step
Epoch 120/120

Epoch 120: mean_squared_error did not improve from 0.00003
63/63 - 159s - loss: 2.8431e-05 - mean_squared_error: 3.0429e-05 - val_loss: 2.6650e-05 - val_mean_squared_error: 2.6650e-05 - 159s/epoch - 3s/step
2024-04-24 06:05:07,577 - wf_psf.training.train - INFO - Cycle2 elapsed time: 22212.329714536667
2024-04-24 06:05:07,727 - wf_psf.training.train - INFO - 
Total elapsed time: 41232.445971
2024-04-24 06:05:07,728 - wf_psf.training.train - INFO - 
 Training complete..
--->2024-04-24 06:05:07,769 - wf_psf.utils.read_config - INFO - Loading.../gpfswork/rech/ynx/uds36vp/repos/physical_layer/configfiles/metrics_config_no_est.yaml
2024-04-24 06:05:07,780 - wf_psf.utils.configs_handler - INFO - Running metrics evaluation on psf model: /gpfswork/rech/ynx/uds36vp/repos/physical_layer/output/wf-outputs/wf-outputs-202404231837/psf_model/psf_model_polyphysical_val-22_no_est_cycle2
--->2024-04-24 06:05:07,780 - wf_psf.utils.read_config - INFO - Loading.../gpfswork/rech/ynx/uds36vp/repos/physical_layer/configfiles/data_config_22.yaml
--->2024-04-24 06:05:11,733 - wf_psf.utils.read_config - INFO - Loading.../gpfswork/rech/ynx/uds36vp/repos/physical_layer/configfiles/data_config_22.yaml
2024-04-24 06:05:15,826 - wf_psf.metrics.metrics_interface - INFO - Fetching and preprocessing training and test data...
--->2024-04-24 06:05:15,828 - wf_psf.metrics.metrics_interface - INFO - Loading PSF model weights from /gpfswork/rech/ynx/uds36vp/repos/physical_layer/output/wf-outputs/wf-outputs-202404231837/psf_model/psf_model_polyphysical_val-22_no_est_cycle2
2024-04-24 06:05:15,905 - wf_psf.metrics.metrics_interface - INFO - 
***
Metric evaluation on the test dataset
***

2024-04-24 06:05:15,905 - wf_psf.metrics.metrics_interface - INFO - Computing polychromatic metrics at low resolution.
38/38 [==============================] - 24s 628ms/step
2024-04-24 06:05:41,379 - wf_psf.metrics.metrics - INFO - Using Ground Truth stars from dataset.
2024-04-24 06:05:41,414 - wf_psf.metrics.metrics - INFO - Absolute RMSE:     4.7764e-03      +/- 1.9585e-03
2024-04-24 06:05:41,414 - wf_psf.metrics.metrics - INFO - Relative RMSE:     4.8935e+01 %    +/- 1.4269e+01 %
2024-04-24 06:05:41,416 - wf_psf.metrics.metrics_interface - INFO - Computing monochromatic metrics.
2

...

CosmoStat / wf-psf

Overwritten log-files #135