davidADSP / GDL_code

The official code repository for examples in the O'Reilly book 'Generative Deep Learning'
GNU General Public License v3.0
1.47k stars 739 forks source link

03_03_vae_digits_train: TypeError: unsupported format string passed to numpy.ndarray.__format__ #73

Open jdinkla opened 4 years ago

jdinkla commented 4 years ago

I am running on Ubuntu 18.04 with Python 3.6.9 and when running 03_03_vae_digits_train I encounter the following error:

vae.train(     
    x_train
    , batch_size = BATCH_SIZE
    , epochs = EPOCHS
    , run_folder = RUN_FOLDER
    , print_every_n_batches = PRINT_EVERY_N_BATCHES
    , initial_epoch = INITIAL_EPOCH
)

I installed using the newest pip with pip install -r requirements.txt and no errors occured and i had to install graphviz.

BTW numpy is 1.17.2 as required.

$ pip freeze | grep numpy
numpy==1.17.2
`

```log
Epoch 1/200
1874/1875 [============================>.] - ETA: 0s - loss: 58.4866 - reconstruction_loss: 55.2065 - kl_loss: 3.2801
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-a0cdb3ff19b5> in <module>
      5     , run_folder = RUN_FOLDER
      6     , print_every_n_batches = PRINT_EVERY_N_BATCHES
----> 7     , initial_epoch = INITIAL_EPOCH
      8 )

~/GDL_code/models/VAE.py in train(self, x_train, batch_size, epochs, run_folder, print_every_n_batches, initial_epoch, lr_decay)
    224             , epochs = epochs
    225             , initial_epoch = initial_epoch
--> 226             , callbacks = callbacks_list
    227         )
    228 

~/GDL_code/env/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
     64   def _method_wrapper(self, *args, **kwargs):
     65     if not self._in_multi_worker_mode():  # pylint: disable=protected-access
---> 66       return method(self, *args, **kwargs)
     67 
     68     # Running inside `run_distribute_coordinator` already.

~/GDL_code/env/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
    874           epoch_logs.update(val_logs)
    875 
--> 876         callbacks.on_epoch_end(epoch, epoch_logs)
    877         if self.stop_training:
    878           break

~/GDL_code/env/lib/python3.6/site-packages/tensorflow/python/keras/callbacks.py in on_epoch_end(self, epoch, logs)
    363     logs = self._process_logs(logs)
    364     for callback in self.callbacks:
--> 365       callback.on_epoch_end(epoch, logs)
    366 
    367   def on_train_batch_begin(self, batch, logs=None):

~/GDL_code/env/lib/python3.6/site-packages/tensorflow/python/keras/callbacks.py in on_epoch_end(self, epoch, logs)
   1175           self._save_model(epoch=epoch, logs=logs)
   1176       else:
-> 1177         self._save_model(epoch=epoch, logs=logs)
   1178     if self.model._in_multi_worker_mode():
   1179       # For multi-worker training, back up the weights and current training

~/GDL_code/env/lib/python3.6/site-packages/tensorflow/python/keras/callbacks.py in _save_model(self, epoch, logs)
   1194                   int) or self.epochs_since_last_save >= self.period:
   1195       self.epochs_since_last_save = 0
-> 1196       filepath = self._get_file_path(epoch, logs)
   1197 
   1198       try:

~/GDL_code/env/lib/python3.6/site-packages/tensorflow/python/keras/callbacks.py in _get_file_path(self, epoch, logs)
   1242         # `{mape:.2f}`. A mismatch between logged metrics and the path's
   1243         # placeholders can cause formatting to fail.
-> 1244         return self.filepath.format(epoch=epoch + 1, **logs)
   1245       except KeyError as e:
   1246         raise KeyError('Failed to format this callback filepath: "{}". '

TypeError: unsupported format string passed to numpy.ndarray.__format__
jdinkla commented 4 years ago

On the tensorflow_2 branch.

jdinkla commented 4 years ago

It works on the master branch!

karaage0703 commented 4 years ago

@jdinkla

I change this line like below.

       - checkpoint_filepath=os.path.join(run_folder, "weights/weights-{epoch:03d}-{loss:.2f}.h5")
       + checkpoint_filepath=os.path.join(run_folder, "weights/weights.h5")

Then I can run 03_03_vae_digits_train with no error.

I create google colab notebook based on 03_03_vae_digits_train.

I hope this notebook helps you.

MarkusMiller commented 4 years ago

Considering the Code around this line:

checkpoint_filepath=os.path.join(run_folder, "weights/weights-{epoch:03d}-{loss:.2f}.h5") checkpoint1 = ModelCheckpoint(checkpoint_filepath, save_weights_only = True, verbose=1) checkpoint2 = ModelCheckpoint(os.path.join(run_folder, 'weights/weights.h5'), save_weights_only = True, verbose=1)

replacing the "weights/weights-{epoch:03d}-{loss:.2f}.h5" with "weights/weights.h5" is sort of pointless, because checkpoint1and checkpoint2would be exactly the same...

I tried to figure out what exactly caused the problem but I'm quite unfamiliar with formatting, so I have kind of an idea what {epoch:03d}-{loss:.2f} does (putting a variable 'epoch' formatted with a leading 0 and 3 digits and a variable 'loss' with 2 decimal places into the string?) but not why. So I'm having the same issue and would be very grateful for a fix. Also branch tensorflow_2

rk-ka commented 3 years ago

I faced this same problem. As far as I can tell, the error occurs due to the fact that the return of the loss function is rewritten in the form of a dictionary. To avoid the error, you can remove the last {loss:.2f} In my case: checkpoint_filepath=os.path.join(run_folder, "weights/weights-{epoch:02d}.h5")

However, in the module "03_04_vae_digits_analysis" I came across the fact that the saved weights in h5 are not loaded into the model. Therefore, I save the weights in .ckpt format.

Working on TF2 branch https://github.com/kubokoHappy/GDL_code_kuboko Using TF 2.3 with gpu

olegboev commented 2 years ago

The problem is that the loss value is a vector of batch size, so it is required to calculate its mean. This fragment:

return {
    "loss": total_loss,
    "reconstruction_loss": reconstruction_loss,
    "kl_loss": kl_loss,
}

should be replaced by this:

return {
    "loss": tf.reduce_mean(total_loss),
    "reconstruction_loss": tf.reduce_mean(reconstruction_loss),
    "kl_loss": tf.reduce_mean(kl_loss),
}