Open jdinkla opened 4 years ago
On the tensorflow_2 branch.
It works on the master branch!
@jdinkla
I change this line like below.
- checkpoint_filepath=os.path.join(run_folder, "weights/weights-{epoch:03d}-{loss:.2f}.h5")
+ checkpoint_filepath=os.path.join(run_folder, "weights/weights.h5")
Then I can run 03_03_vae_digits_train with no error.
I create google colab notebook based on 03_03_vae_digits_train.
I hope this notebook helps you.
Considering the Code around this line:
checkpoint_filepath=os.path.join(run_folder, "weights/weights-{epoch:03d}-{loss:.2f}.h5") checkpoint1 = ModelCheckpoint(checkpoint_filepath, save_weights_only = True, verbose=1) checkpoint2 = ModelCheckpoint(os.path.join(run_folder, 'weights/weights.h5'), save_weights_only = True, verbose=1)
replacing the "weights/weights-{epoch:03d}-{loss:.2f}.h5"
with "weights/weights.h5"
is sort of pointless, because checkpoint1
and checkpoint2
would be exactly the same...
I tried to figure out what exactly caused the problem but I'm quite unfamiliar with formatting, so I have kind of an idea what {epoch:03d}-{loss:.2f}
does (putting a variable 'epoch' formatted with a leading 0 and 3 digits and a variable 'loss' with 2 decimal places into the string?) but not why. So I'm having the same issue and would be very grateful for a fix. Also branch tensorflow_2
I faced this same problem. As far as I can tell, the error occurs due to the fact that the return of the loss function is rewritten in the form of a dictionary. To avoid the error, you can remove the last {loss:.2f}
In my case:
checkpoint_filepath=os.path.join(run_folder, "weights/weights-{epoch:02d}.h5")
However, in the module "03_04_vae_digits_analysis" I came across the fact that the saved weights in h5 are not loaded into the model. Therefore, I save the weights in .ckpt format.
Working on TF2 branch https://github.com/kubokoHappy/GDL_code_kuboko Using TF 2.3 with gpu
The problem is that the loss value is a vector of batch size, so it is required to calculate its mean. This fragment:
return {
"loss": total_loss,
"reconstruction_loss": reconstruction_loss,
"kl_loss": kl_loss,
}
should be replaced by this:
return {
"loss": tf.reduce_mean(total_loss),
"reconstruction_loss": tf.reduce_mean(reconstruction_loss),
"kl_loss": tf.reduce_mean(kl_loss),
}
I am running on Ubuntu 18.04 with Python 3.6.9 and when running 03_03_vae_digits_train I encounter the following error:
I installed using the newest pip with
pip install -r requirements.txt
and no errors occured and i had to install graphviz.BTW numpy is 1.17.2 as required.