Code not optimized for GPU

RahulBhalley commented 1 year ago

Hi @JCBrouwer, I've been playing with your code. It's really good!

But the only issue is it doesn't seem like it's optimized for GPU. I mean the GPU utilization on average is ~40%.

Do you have any suggestions regarding optimizing it for running on GPU?

Regards, Rahul Bhalley

JCBrouwer commented 1 year ago

Hi Rahul, thanks for your interest in the code! I've updated the PyTorch version and swapped over to torch.linalg.eigh like you suggested in #3.

In terms of the performance issues, I believe it's mainly due to the fact that the data is relatively small and so can't saturate the GPU. When using the multi-scale mode the image is first optimized at a smaller resolution and progressively upscaled to the final desired size.

This leads me to believe that the low utilization is primarily due to overhead of repeatedly launching many small CUDA kernels. To me this sounds like an ideal setting for torch's CUDA graphs API.

It might require a bit more detailed profiling to be sure that this is the issue though.

JCBrouwer commented 1 year ago

Alright, did some quick profiling, looks like it's not the kernel overhead, but just host operations in general...

temporal_breakdown idle_time_breakdown

RahulBhalley commented 1 year ago

Woah! A ~90% speedup will make it really fast! I have few questions:

What does 'host_wait' means? Is it the GPU waiting for CPU to complete its task?
If so, any guidance how to track this?
What's the name of this profiling tool?

JCBrouwer commented 1 year ago

The plots are from Holistic Trace Analysis. 'host_wait' is indeed the GPU waiting for the CPU to give it work.

Looking a bit closer at the actual traces shows that drawing the random rotation dominates the time of each histogram matching iteration. Just replacing the .item() call in there with a .clone() helps a little as it saves a round-trip to host memory, but overall utilization still isn't great. I also tried decorating the function with @torch.jit.script but it didn't help that much either. The trace of this function is still pre-dominantly CPU operations even though the device is correctly specified as 'cuda' as far as I can tell. I wonder if there's some way to vectorize this operation?

Another small improvement is using the 'chol' histogram matching method instead of 'pca'. Doing a cholesky decomposition is quite a bit faster than running the eigenvalue solver.

One last thing that helped quite a bit for me is to set torch.backends.cudnn.benchmark = False. This is because the implementation repeatedly cycles through forward passes at different resolutions which requires the cudnn autotuner to re-run every time for just a single forward pass.

I also tried cutting out some of the encode/decode steps which are happening at the beginning and end of each pass, but it seems like the feature inverters are actually separately trained for each depth they invert from, so this ruins the quality of results.

You can see some of the things I tried in this branch.

JCBrouwer commented 1 year ago

Alright, I've just merged a refactor which makes a few changes for better performance. I've got a few more ideas, but give this version a try and please let me know how it compares on your machine.

RahulBhalley commented 1 year ago

Alright, I've just merged a refactor which makes a few changes for better performance. I've got a few more ideas, but give this version a try and please let me know how it compares on your machine.

I deeply apologize for not replying. I got a little sick right after opening this issue. I'll surely test it out & let you know. Thank you for doing all this. :)

RahulBhalley commented 1 year ago

Not sure how much you changed the code but my first script run fails to converge. I used my same previous arguments. Also tried changing seed. Now I'll just start from where you started (profiling the previous code) and then make changes slowly to the code.

Pass 0, size 256
Layer: relu5_1
Layer: relu4_1
Traceback (most recent call last):
  File "[/workspace/OptimalTextures/optex.py](https://file+.vscode-resource.vscode-cdn.net/workspace/OptimalTextures/optex.py)", line 283, in <module>
    pastiche = texturizer.forward(pastiche, styles, content, verbose=True)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "[/workspace/OptimalTextures/optex.py](https://file+.vscode-resource.vscode-cdn.net/workspace/OptimalTextures/optex.py)", line 112, in forward

                for _ in range(self.iters_per_pass_and_layer[p][l - 1]):
                    pastiche_feature = optimal_transport(pastiche_feature, style_features[l], self.hist_mode)
                                       ~~~~~~~~~~~~~~~~~ <--- HERE

                    if len(content_features) > 0 and l >= 2:  # apply content matching step
  File "[/workspace/OptimalTextures/optex.py](https://file+.vscode-resource.vscode-cdn.net/workspace/OptimalTextures/optex.py)", line 168, in optimal_transport
    rotated_style = style_feature @ rotation

    matched_pastiche = hist_match(rotated_pastiche, rotated_style, mode=hist_mode)
                       ~~~~~~~~~~ <--- HERE

    pastiche_feature = matched_pastiche @ rotation.T  # rotate back to normal
  File "[/workspace/OptimalTextures/histmatch.py](https://file+.vscode-resource.vscode-cdn.net/workspace/OptimalTextures/histmatch.py)", line 37, in hist_match

        else:  # mode == "sym"
            eva_t, eve_t = torch.linalg.eigh(cov_t, UPLO="U")
                           ~~~~~~~~~~~~~~~~~ <--- HERE
            Qt = eve_t @ torch.sqrt(torch.diag(eva_t)) @ eve_t.T
            Qt_Cs_Qt = Qt @ cov_s @ Qt
RuntimeError: linalg.eigh: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 301).

JCBrouwer commented 1 year ago

Ahh I see you're using the 'sym' hist_mode. I've been using 'chol' as it's quite a bit faster (and apparently doesn't have these convergence issues?)

One thing you can do to help with convergence is to increase the eps argument of hist_match(). I've just pushed another small update which is even a little faster on my machine (with eps bumped up quite a bit).

Profiler is still showing random_rotation() as the bottleneck, but I'm just not sure how to make that more efficient.

RahulBhalley commented 1 year ago

Ahh I see you're using the 'sym' hist_mode. I've been using 'chol' as it's quite a bit faster (and apparently doesn't have these convergence issues?)

Okay, did try that. But the results are now inferior to your previous code (before I pinged you).

I used same command for style transfer: python optex.py --style style/lava-small.jpg --content content/rocket.jpg --content_strength 0.2 --hist chol --seed 0.

Synthesis with previous source code lava-small_rocket_strength0 2_cholhist_512

Synthesis with current modification you made lava-small_rocket_strength0 2_cholhist_512

I am still reading the paper (started today) so I am far from understanding the code. But I will, soon.

RahulBhalley commented 1 year ago

One thing you can do to help with convergence is to increase the eps argument of hist_match(). I've just pushed another small update which is even a little faster on my machine (with eps bumped up quite a bit).

How much time it takes and what resolution? For me, these took 36s (previous code) and 34.5s (current code). I didn't try multiple runs so it's not an average time.

JCBrouwer commented 1 year ago

My bad, I missed swapping the if statement's condition when I reversed the for loop's direction.

For me the original code was taking about 30 seconds for the simple texture synthesis case and now is around 11 seconds on a 1080 ti.

I haven't been testing the style transfer case though (as is apparent by the error you just encountered). I guess I should write a little test suite...

RahulBhalley commented 1 year ago

I haven't been testing the style transfer case though (as is apparent by the error you just encountered). I guess I should write a little test suite...

Interesting, even on my side texture was synthesized correctly.

OMG, the texture is very heavy & large scale now.

lava-small_rocket_strength0 2_cholhist_512

RahulBhalley commented 1 year ago

Now, I'm also unable to push the resolution above 1024.

Pass 0, size 256
Layer: relu5_1
Layer: relu4_1
Layer: relu3_1
Layer: relu2_1
Layer: relu1_1
Pass 1, size 512
Layer: relu5_1
Layer: relu4_1
Layer: relu3_1
Layer: relu2_1
Layer: relu1_1
Pass 2, size 768
Layer: relu5_1
Layer: relu4_1
Layer: relu3_1
Layer: relu2_1
Layer: relu1_1
Pass 3, size 1024
Layer: relu5_1
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│                                                                                                  │
│ /workspace/OptimalTextures/optex.py:283  │
│ in <module>                                                                                      │
│                                                                                                  │
│   280 │   │   from time import time                                                              │
│   281 │   │                                                                                      │
│   282 │   │   t = time()                                                                         │
│ ❱ 283 │   │   pastiche = texturizer.forward(pastiche, styles, content, verbose=True)             │
│   284 │   │   print("Took:", time() - t)                                                         │
│   285 │                                                                                          │
│   286 │   save_image(pastiche, args)                                                             │
│ /workspace/OptimalTextures/optex.py:116  │
│ in forward                                                                                       │
│                                                                                                  │
│   113 │   │   │   │   │                                                                          │
│   114 │   │   │   │   │   if len(content_features) > 0 and l <= 2:  # apply content matching s   │
│   115 │   │   │   │   │   │   strength = self.content_strength / 2 ** (4 - l)  # 1, 2, or 4 de   │
│ ❱ 116 │   │   │   │   │   │   pastiche_feature += strength * (content_features[l] - pastiche_f   │
│   117 │   │   │   │                                                                              │
│   118 │   │   │   │   if self.use_pca:                                                           │
│   119 │   │   │   │   │   pastiche_feature = pastiche_feature @ style_eigvs[l].T  # reverse pr   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: The size of tensor a (80) must match the size of tensor b (64) at non-singleton dimension 2

JCBrouwer commented 1 year ago

Hmmm, could you give the exact command you ran here? If I had to guess I'd say it's related to rounding errors in the multi-resolution resizing code. Are you using a non-square image?

For me the following is working fine on the current main branch.

python optex.py --style style/lava-small.jpg --content content/rocket.jpg --content_strength 0.5 --size 1448

lava-small_rocket_strength0 5_cholhist_1448

RahulBhalley commented 1 year ago

Hmmm, could you give the exact command you ran here? If I had to guess I'd say it's related to rounding errors in the multi-resolution resizing code. Are you using a non-square image?

For me the following is working fine on the current main branch.
python optex.py --style style/lava-small.jpg --content content/rocket.jpg --content_strength 0.5 --size 1448

Could be something wrong on my end if yours is working fine. I'll not ping you until I understand the whole paper and your code. I don't want to consume your time. You might be busy somewhere else. :) thanks for your help btw.

JCBrouwer / OptimalTextures

Code not optimized for GPU #4