Result always comes out plain color

dotKokott commented 6 years ago

I have a problem getting this to run on multiple machines.

I tried running this exactly like suggested on the AWS machine, my own local ubuntu installation and some other cloud bare metal servers. Even after your recent changes it is not better (I seem to no be able to compile the specific Torch version you are checking out and had to change it to master, could it be the problem?)

All I'm getting after running the algo on my model is a plain black or differently colored image. Tested with different styles, input images and parameters.

Any clue?

EDIT: One thing I noticed is that the loss seems to quite randomly either get lower or continuously higher. Should it be going down?

besirkurtulmus commented 6 years ago

I've been trying to identify the issue. Tried a few things (hence the commits), but they still haven't fixed the main issue. Still investigating.

Regarding the loss, based on the image I took a year ago, it's supposed to gradually decrease over time. When I train it now, I get the following:

#it:    1       loss:   1341921.9316406
#it:    2       loss:   1306356.3671875
#it:    3       loss:   1.7821605174124e+26
#it:    4       loss:   7.4171430364599e+26
#it:    5       loss:   4.6799621452352e+26
#it:    6       loss:   3.7766420358899e+26
#it:    7       loss:   1.1445348689613e+26
#it:    8       loss:   3.9928882446935e+26
#it:    9       loss:   2.2470980196544e+26
#it:    10      loss:   5.4632667864736e+26
#it:    11      loss:   3.0395338640645e+26
#it:    12      loss:   7.6261395163511e+25
#it:    13      loss:   3.1009426859745e+26
#it:    14      loss:   2.8894353736666e+26
#it:    15      loss:   1.9068393512413e+26
#it:    16      loss:   1.4979229966606e+26
#it:    17      loss:   2.7150609824805e+26
#it:    18      loss:   4.7865869056433e+26
#it:    19      loss:   1.3812571258155e+26
#it:    20      loss:   3.4940852441089e+26
#it:    21      loss:   2.0146548633907e+26
#it:    22      loss:   2.5888171585897e+26
#it:    23      loss:   2.3111547491787e+26
#it:    24      loss:   4.7020759579157e+26
#it:    25      loss:   3.6367000633728e+26
#it:    26      loss:   1.933889274846e+26

At the 3rd iteration, there's a sharp drop in the loss. After that, it basically floats between 1-10. There's definitely something wrong with the training.

dotKokott commented 6 years ago

Ah that is great to know. I have been blowing some weeks and credits on this thinking the problem is on my side 😄

I will try the tensorflow implementation of the same algorithm now and see if that is working still. Will report back!

EDIT: Turns out the tensorflow implementation I found doesn't implement the training part.. Anyhow I will keep experimenting.

dotKokott commented 6 years ago

Progress!

I managed to get something working using a docker image I found from the deep-photo-styletransfer repo: https://github.com/martinbenson/deep-photo-styletransfer

It seems to have Torch configuration frozen that resembles something that was working at the time. I only iterated until 10k iterations so far but if I plug that checkpoint model into algorithmia I get a working output and the loss seems to be going steadily down.

Probably a good idea to clone (and clean) that docker image and provide it with the environment you supply.

besirkurtulmus commented 6 years ago

@dotKokott I'll look into switching out the base image if the training process is not fixable.

algorithmiaio / deepfilter-training

Result always comes out plain color #1