NVlabs / stylegan2-ada-pytorch

StyleGAN2-ADA - Official PyTorch implementation
https://arxiv.org/abs/2006.06676
Other
4.13k stars 1.17k forks source link

CUDA error: misaligned address #264

Closed neilthefrobot closed 2 years ago

neilthefrobot commented 2 years ago

Describe the bug When projecting an image to latent space I get "CUDA error: misaligned address" Generating images works fine. Edit: This also happens when training, but randomly.

To Reproduce Steps to reproduce the behavior: run command python projector.py --outdir=out --target=img.jpg --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/ffhq.pkl

Loading networks from "https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metfaces.pkl"... projector.py:172: DeprecationWarning: LANCZOS is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.LANCZOS instead. target_pil = target_pil.resize((G.img_resolution, G.img_resolution), PIL.Image.LANCZOS) Computing W midpoint and stddev using 10000 samples... Setting up PyTorch plugin "bias_act_plugin"... Done. Setting up PyTorch plugin "upfirdn2d_plugin"... Done. step 1/1000: dist 0.68 loss 24567.47 step 2/1000: dist 0.74 loss 27640.90 step 3/1000: dist 0.66 loss 27166.95 step 4/1000: dist 0.81 loss 26253.69 step 5/1000: dist 0.70 loss 24957.32 step 6/1000: dist 0.67 loss 23353.60 step 7/1000: dist 0.69 loss 21512.90 step 8/1000: dist 0.68 loss 19486.20 step 9/1000: dist 0.67 loss 17338.49 step 10/1000: dist 0.69 loss 15140.59 step 11/1000: dist 0.64 loss 12948.86 step 12/1000: dist 0.67 loss 10819.47 step 13/1000: dist 0.66 loss 8801.30 step 14/1000: dist 0.67 loss 6947.03 step 15/1000: dist 0.64 loss 5315.52 step 16/1000: dist 0.66 loss 3970.58 step 17/1000: dist 0.65 loss 2943.32 step 18/1000: dist 0.65 loss 2214.79 step 19/1000: dist 0.65 loss 1760.58 step 20/1000: dist 0.63 loss 1566.97 step 21/1000: dist 0.63 loss 1602.20 step 22/1000: dist 0.65 loss 1787.60 step 23/1000: dist 0.65 loss 2053.56 step 24/1000: dist 0.62 loss 2326.88 step 25/1000: dist 0.64 loss 2537.69 step 26/1000: dist 0.65 loss 2638.91 step 27/1000: dist 0.63 loss 2605.35 step 28/1000: dist 0.63 loss 2477.48 step 29/1000: dist 0.62 loss 2316.19 step 30/1000: dist 0.63 loss 2121.47 step 31/1000: dist 0.63 loss 1883.73 step 32/1000: dist 0.62 loss 1626.40 step 33/1000: dist 0.61 loss 1384.59 step 34/1000: dist 0.62 loss 1182.10 step 35/1000: dist 0.62 loss 1026.42 step 36/1000: dist 0.62 loss 905.88 step 37/1000: dist 0.63 loss 825.34 step 38/1000: dist 0.62 loss 808.28 step 39/1000: dist 0.61 loss 818.23 step 40/1000: dist 0.61 loss 831.69 step 41/1000: dist 0.61 loss 828.77 step 42/1000: dist 0.61 loss 768.90 step 43/1000: dist 0.61 loss 654.59 step 44/1000: dist 0.61 loss 528.88 step 45/1000: dist 0.60 loss 410.70 step 46/1000: dist 0.61 loss 321.64 step 47/1000: dist 0.61 loss 294.12 step 48/1000: dist 0.60 loss 280.07 step 49/1000: dist 0.60 loss 263.28 step 50/1000: dist 0.61 loss 344.50 step 51/1000: dist 0.59 loss 375.47 step 52/1000: dist 0.61 loss 404.35 step 53/1000: dist 0.60 loss 396.30 step 54/1000: dist 0.61 loss 347.72 step 55/1000: dist 0.61 loss 279.30 step 56/1000: dist 0.60 loss 200.56 step 57/1000: dist 0.60 loss 128.34 step 58/1000: dist 0.60 loss 73.80 step 59/1000: dist 0.60 loss 61.11 step 60/1000: dist 0.59 loss 47.86 step 61/1000: dist 0.60 loss 65.66 step 62/1000: dist 0.59 loss 84.21 step 63/1000: dist 0.59 loss 114.19 Traceback (most recent call last): File "projector.py", line 210, in run_projection() # pylint: disable=no-value-for-parameter File "C:\Users\imsog\Anaconda3\envs\torchenv\lib\site-packages\click\core.py", line 1130, in call return self.main(args, kwargs) File "C:\Users\imsog\Anaconda3\envs\torchenv\lib\site-packages\click\core.py", line 1055, in main rv = self.invoke(ctx) File "C:\Users\imsog\Anaconda3\envs\torchenv\lib\site-packages\click\core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "C:\Users\imsog\Anaconda3\envs\torchenv\lib\site-packages\click\core.py", line 760, in invoke return __callback(args, **kwargs) File "projector.py", line 182, in run_projection verbose=True File "projector.py", line 120, in project logprint(f'step {step+1:>4d}/{num_steps}: dist {dist:<4.2f} loss {float(loss):<5.2f}') File "C:\Users\imsog\Anaconda3\envs\torchenv\lib\site-packages\torch\tensor.py", line 534, in format return self.item().format(format_spec) RuntimeError: CUDA error: misaligned address

And here is the error when transfer learning on a new dataset - tick 0 kimg 0.0 time 1m 10s sec/tick 4.9 sec/kimg 1213.28 maintenance 65.1 cpumem 3.03 gpumem 7.46 augment 0.000 Traceback (most recent call last): File "train.py", line 538, in main() # pylint: disable=no-value-for-parameter File "C:\Users\imsog\Anaconda3\envs\torchenv\lib\site-packages\click\core.py", line 1130, in call return self.main(args, kwargs) File "C:\Users\imsog\Anaconda3\envs\torchenv\lib\site-packages\click\core.py", line 1055, in main rv = self.invoke(ctx) File "C:\Users\imsog\Anaconda3\envs\torchenv\lib\site-packages\click\core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "C:\Users\imsog\Anaconda3\envs\torchenv\lib\site-packages\click\core.py", line 760, in invoke return __callback(args, kwargs) File "C:\Users\imsog\Anaconda3\envs\torchenv\lib\site-packages\click\decorators.py", line 26, in new_func return f(get_current_context(), args, kwargs) File "train.py", line 531, in main subprocess_fn(rank=0, args=args, temp_dir=temp_dir) File "train.py", line 383, in subprocess_fn training_loop.training_loop(rank=rank, args) File "C:\Users\imsog\PycharmProjects\pyTorch\stylegan2-ada-pytorch-main\training\training_loop.py", line 284, in training_loop loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, sync=sync, gain=gain) File "C:\Users\imsog\PycharmProjects\pyTorch\stylegan2-ada-pytorch-main\training\loss.py", line 98, in accumulate_gradients gen_img, _gen_ws = self.run_G(gen_z, gen_c, sync=False) File "C:\Users\imsog\PycharmProjects\pyTorch\stylegan2-ada-pytorch-main\training\loss.py", line 40, in run_G ws = self.G_mapping(z, c) File "C:\Users\imsog\Anaconda3\envs\torchenv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "C:\Users\imsog\PycharmProjects\pyTorch\stylegan2-ada-pytorch-main\training\networks.py", line 220, in forward x = normalize_2nd_moment(z.to(torch.float32)) File "C:\Users\imsog\PycharmProjects\pyTorch\stylegan2-ada-pytorch-main\torch_utils\misc.py", line 101, in decorator return fn(*args, *kwargs) File "C:\Users\imsog\PycharmProjects\pyTorch\stylegan2-ada-pytorch-main\training\networks.py", line 22, in normalize_2nd_moment return x (x.square().mean(dim=dim, keepdim=True) + eps).rsqrt() RuntimeError: CUDA error: misaligned address

Desktop (please complete the following information):

neilthefrobot commented 2 years ago

I lowered my overclock settings and it works fine. Really strange to me since I've been using the same OC settings for years of gaming and deep learning with no issues. But style gan simply won't run until I lower it.