Just a question - Githubissues

ghost commented 8 years ago

I have a quick question. I was thinking of writing a wrapper for https://github.com/jwetzl/CudaLBFGS to avoid using scipy. Do you think that would help speed up the optimizations? I have large batches of images to process, so unfortunately waiting is not an option. If you have any advice or ideas, I'd appreciate it.

fzliu commented 8 years ago

Thanks for your interest in the code! I've actually recently been looking into optimizing the pipeline as much as possible, and L-BFGS was one of the first things I looked at. I did some benchmarks using a max length of 500 for input images - it turns out that that around 90% of the time is spent in the forward-backward steps for VGG. You'll probably be able to squeeze out a bit of extra performance by moving the loss minimization step to the GPU, but my guess is that it'll be negligible unless you're working with large images.

With that being said, I'm about to push some code to the develop branch, which starts with a small image and progressively scales it up, using the output of the smaller inputs initialize the next minimization pass. The total runtime here is around 10 minutes on my CPU and the results are comparable to just doing a standard pass (which takes over 2 hours) for 500 iterations.

I haven't tried it on GPU because my poor 750M doesn't have enough memory to support the VGG model. If you have a K40/K80 or Titan X/Z, feel free to drop some performance numbers here - I'm curious to see how fast it will run on a state-of-the-art GPU.

ghost commented 8 years ago

Yeah totally, I have a titan X. Update the develop branch and I'll run some tests tomorrow. Also I just updated my master branch and I noticed the "-g 0" flag is not working anymore. The older version of the code runs fine on the gpu. I just thought to give you a heads up. That's strange. I'll try to debug that tomorrow too.

fzliu commented 8 years ago

Just pushed to develop - results might require some tuning but they should be decent for now. And yup, there are unfortunately some issues with the master branch right now due to a merge from a couple of days ago, but I'm holding off on any further changes to master until the runtime gets into a more reasonable range.

ghost commented 8 years ago

I ran some tests.

At 1920 x 1080 pixels, using around 9Gb GPU memory:

style.py::09:43:53.369 -- Starting style transfer. style.py::09:43:54.560 -- Running net on GPU 0. style.py::09:43:55.576 -- Successfully loaded images. style.py::09:44:00.524 -- Successfully loaded model vgg. style.py::09:44:00.524 -- Minimization pass 1 of 3. Optimizing: 100% ||||||||||||||||||||||||||||||||||||||||||||||||| Time: 0:07:45 style.py::09:51:46.582 -- Ran 513 iterations in 466s. style.py::09:51:46.587 -- Minimization pass 2 of 3. Optimizing: 100% ||||||||||||||||||||||||||||||||||||||||||||||||| Time: 0:07:14 style.py::09:59:03.202 -- Ran 129 iterations in 437s. style.py::09:59:03.222 -- Minimization pass 3 of 3. Optimizing: 100% ||||||||||||||||||||||||||||||||||||||||||||||||| Time: 0:08:13 style.py::10:07:24.604 -- Ran 33 iterations in 501s. /usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py:111: UserWarning: Possible precision loss when converting from float32 to uint8 "%s to %s" % (dtypeobj_in, dtypeobj)) 23 minutes and 45 seconds elapsed.

At 524 pixels:

style.py::10:26:58.633 -- Starting style transfer. style.py::10:26:58.805 -- Running net on GPU 0. style.py::10:26:59.710 -- Successfully loaded images. style.py::10:27:00.893 -- Successfully loaded model vgg. style.py::10:27:00.893 -- Minimization pass 1 of 3. Optimizing: 100% ||||||||||||||||||||||||||||||||||||||||||||||||| Time: 0:01:56 style.py::10:28:57.870 -- Ran 513 iterations in 117s. style.py::10:28:57.871 -- Minimization pass 2 of 3. Optimizing: 100% ||||||||||||||||||||||||||||||||||||||||||||||||| Time: 0:00:49 style.py::10:29:47.980 -- Ran 129 iterations in 50s. style.py::10:29:47.982 -- Minimization pass 3 of 3. Optimizing: 100% ||||||||||||||||||||||||||||||||||||||||||||||||| Time: 0:00:39 style.py::10:30:28.202 -- Ran 33 iterations in 40s. /usr/local/lib/python2.7/dist-packages/skimage/util/dtype.py:111: UserWarning: Possible precision loss when converting from float32 to uint8 "%s to %s" % (dtypeobj_in, dtypeobj)) 3 minutes and 30 seconds elapsed.

Before the change, a 524px image took 10 minutes, now down to 3:30. And a 1920px took 2 or so hours, now down to 23 minutes. It's an interesting trick you did. Results are close, but some areas are messed up. Are the torch implementations doing the same thing? And also, what exactly are you doing, running a number of iterations on a smaller scale, then scaling up and running another pass?

fzliu commented 8 years ago

Yeah - I do an initial set of updates at a smaller scale, then scale the image up before running more iterations. I'd expect the results for large images to be a bit messed up without some tuning, though.

I'm surprised that it still takes so long on the GPU; I expected runtimes of <1 minute on the GPU for length ~500 images. My guess is that this is due to the synchronization between the CPU and GPU for each of the 6 VGG layers being used. I knew that there would be overhead here, but I didn't realize that it would be that much. A torch-based implementation wouldn't need this synchronization step, since you can add custom loss layers on the fly and just do the entire forward-backward pass on the GPU. I'm sure this could be done in Caffe as well, but it would require injecting some custom C++ code for the style loss and building separate "datasets" for each pair of input content and style images, among other things. With that being said, the only two torch implementations I'm familiar with are Kai Sheng's and Justin's. I don't think either of those implementations are doing what I'm doing here, but I'd expect them to run faster with this progressive upscaling.

Could you post some example outputs? For reference, I just tried one of Justin's images and got these results for progressive upscaling:

golden_gate-rain_princess-vgg-content-1e5-512 golden_gate-starry_night-vgg-content-1e5-512

versus standard:

rain_princess-golden_gate-vgg-content-1e5-500 starry_night-golden_gate-vgg-content-1e5-500

ghost commented 8 years ago

Sometimes the results are even better.

This is with no progressive up-scaling: result

This is with progressive up-scaling: test

Both are with the same settings though. Playing with -r does make a difference. I'll run a big image and post it as soon as I can.

ghost commented 8 years ago

Also have you tried the normalized VGG model?

ghost commented 8 years ago

Here's a an HD image!

Progressive up-scaling @ 21 minutes: test2

No progressive up-scaling @ a painful 138 minutes: result

I don't have a super fast CPU: Intel® Core™ i7-3820 CPU @ 3.60GHz so the sync is probably slowing me down. The progressive up-scaling is a great idea though, very clever. Whenever you need GPU benchmarks, specify the test and I’ll run it for you. I also think you should add a form of tv_loss like in Justin's or Kai's. It would greatly improve image quality.

fzliu commented 8 years ago

Awesome results - thanks for generating them! TV denoising (or some other type of smoothing) is on my TODO list. I haven't tried the normalized model, but I think the standard VGG models should be able to generate better results.

As always, feel free to suggest further improvements and/or optimizations.

ghost commented 8 years ago

Hi again. I just ran your old develop branch code on a fedora. Do you know if there is any reason why it doesn't multi-thread? It's slow on fedora.

fzliu commented 8 years ago

Sorry for the late reply (been really busy at work lately). I don't think the code is multi-threaded by default, unless you have MKL installed. You're running it on the CPU, correct?

dpaiton commented 8 years ago

Just an update, I tested the gpu flag and it seems to be working fine on my end, not sure if you're still having this issue. Also, as far as the normalized VGG model goes, I have tried it but I was not able to get it to work as well as I wanted. Rescaling the ratio parameter does not seem to have an effect. It might be an issue with my model.

ghost commented 8 years ago

I had messed up my environment variables. Obviously the code is multi-threaded on CPU mode, if you set the env variables up correctly for mkl or openblas. Weirdly enough, on GPU mode (-g 0), it is also multi-threaded for me, but only with openblas, and only on ubuntu. I don't know why.

fzliu commented 8 years ago

Hmm, this seems strange, but it seems more related to Caffe than anything else. What is the Caffe commit hash you're using?

ghost commented 8 years ago

I don't know the hash. I didn't download through git. All I know is that it's from Aug 27-28 2015. I can send you a zip if you want.

TomArrow commented 6 years ago

I wanted to try this progressive upscaling, but I see no development branch, only the gram-thing. Is it still possible to get this?

fzliu / style-transfer

Just a question #16

At 1920 x 1080 pixels, using around 9Gb GPU memory:

At 524 pixels: