Training loss doesn't decrease

affinelayer / pix2pix-tensorflow

Tensorflow port of Image-to-Image Translation with Conditional Adversarial Nets https://phillipi.github.io/pix2pix/

MIT License

5.07k stars 1.3k forks source link

Training loss doesn't decrease #96

Open groot-1313 opened 6 years ago

groot-1313 commented 6 years ago

I have completed 2 epochs of training on a datatset which contains 21000 images. But my training loss has not reduced at all.

A small snippet:

progress  epoch 3  step 3704  image/sec 4.8  remaining 14834m
discrim_loss 0.82492
gen_loss_GAN 2.00074
gen_loss_L1 0.148627
recording summary
progress  epoch 3  step 3754  image/sec 4.8  remaining 14840m
discrim_loss 0.831237
gen_loss_GAN 2.02428
gen_loss_L1 0.147841
progress  epoch 3  step 3804  image/sec 4.8  remaining 14834m
discrim_loss 0.797859
gen_loss_GAN 1.99995
gen_loss_L1 0.144412

The first epoch had similar losses. I know I should train it on a few more epochs, but with each epoch, the network is trained on 21000 images, which I believe should have caused a decrease in the loss. Any inputs on how to proceed will be very much appreciated!

julien2512 commented 6 years ago

I experienced problems when expended a dataset 10 times : I was needed to retrain completely the network ! My expended data was not the same nature than my initial dataset.

May your dataset is sorted in a particuliar order ? I mean if you try to train A A A A A A A A A ... 10000 times then B B B B B B B B B ... 10000 times it won't work as if you learn A B A B A B A B A B ...

Otherwise, I always see a 100% training reduce with a few steps on my own sets (because we start from random values, it is really easy for the training to match something better).

You can try the tensorboard and what is call embeddings if you want to visualise your dataset.

groot-1313 commented 6 years ago

My dataset is a video file. So there is gradual change in scenes. There should be an option to shuffle the dataset while training, no?

julien2512 commented 6 years ago

It is for now sorted by name :

# if the image names are numbers, sort by the value rather than asciibetically
# having sorted inputs means that the outputs are sorted in test mode
if all(get_name(path).isdigit() for path in input_paths):
    input_paths = sorted(input_paths, key=lambda path: int(get_name(path)))
else:
    input_paths = sorted(input_paths)

You can try with random.shuffle instead !

julien2512 commented 6 years ago

I am confused, shuffle is already enable here :

with tf.name_scope("load_images"): path_queue = tf.train.string_input_producer(input_paths, shuffle=a.mode == "train") reader = tf.WholeFileReader() paths, contents = reader.read(path_queue) raw_input = decode(contents) raw_input = tf.image.convert_image_dtype(raw_input, dtype=tf.float32)

as a.mode == "train".

I got no more ideas than a bad source repartition in your dimensions. Does a more epochs change anything ?

groot-1313 commented 6 years ago

Yes, I am unsure why!

dustyYMelody7 commented 6 years ago

I also want to know why? please.

julien2512 commented 6 years ago

The gradient algorithm use to correct the generator answer with a little step only.

With a step too high, you may miss the winning ticket.

Within 1 epoch, the answer for each image is corrected once with a little step.

Then it is better to learn with more epochs, step by step ;)