SHI-Labs / StyleNAT

New flexible and efficient image generation framework that sets new SOTA on FFHQ-256 with FID 2.05, 2022
MIT License
99 stars 11 forks source link

python main.py type=inference #7

Closed minuenergy closed 1 year ago

minuenergy commented 1 year ago

My server is Linux 18.04 , cuda toolkit 11.7, 2080ti

I use docker which is Linux 18.04, cuda toolkit=11.6, python3.10 when I command python main.py type=inference

image

include

issues came out how can i fix this??

stevenwalton commented 1 year ago

These issues can be a bit tricky to solve sometimes (you'll see plenty of similar questions on the stylegan repo). In my experience there are three things to check first. I'll try to be as complete as possible.

1. Make sure to clear out the pytorch build cache.

This is under ~/.cache/torch_extensions/ for *nix machines. Here's what mine looks like

(py310) λ ~/ ls .cache/torch_extensions 
bias_act_plugin  fused  py37_cu113  upfirdn2d_plugin
(py310) λ ~/ ls .cache/torch_extensions/py37_cu113 
fused  nattenav_cuda  nattenqkrpb_cuda  upfirdn2d

You can safely delete things in that folder. What's shown is specifically build from StylenNAT (upnfirdn2d, bias_act_plugin and fused are from StyleGAN and anything with natten is from neighborhood attention).

While we have an option for natten to be built before runtime I don't think Karras provided a clean way to do this. You can manually do it by using the module's init function. Here's the section for bias_act (which appears to be the function failing to build). Just in case, here's the original version which has a fallback (you'd need to do the same for other custom ops like fma and upfirdn2d but this is not preferable). I believe this was taken out in StyleGAN3.

3. Sometimes pytorch ships a bad cudatoolkit.

Do you know how you installed everything? The currently listed command is conda install pytorch torchvision pytorch-cuda=11.7 -c pytorch -c nvidia. This is the same command I used but their version changes (in fact, it has since this code was released). I always rely on torch's official website rather than any other source. Sometimes doing a fresh install can help (rebuild without cache. Sometimes try rebuilding conda). Can you also try using cuda 11.7?

3. nvcc is frequently an issue.

Do you have multiple versions on your machine? When building the cudatoolkit you get an nvcc version as well. This should probably be prioritized. I suggest checking without loading your conda environment first and then with it. See my example here.

(base) λ ~/ which nvcc
/usr/bin/nvcc
(base) λ ~/ conda activate py310
(py310) λ ~/ which nvcc
/home/users/swalton2/.anaconda3/envs/py310/bin/nvcc

You can also check what your system has. You can do this all at once by searching / but I suggest breaking it into multiple commands because that might be slow (just searching these two locations should be sufficient but might not be. 2> /dev/null just throws away errors like not being able to access a location since you're not sudo)

 $ find /usr -name nvcc 2> /dev/null 
 $ find /opt -name nvcc 2> /dev/null

To prioritize a nvcc version you can prioritize your PATH variable. For example I use export PATH="${USER}/export /.anaconda3/bin:$PATH" (note that my conda location is different than yours, which is in /opt). This makes your system check anaconda for programs before it checks elsewhere (such as /usr/bin!). Verify with echo $PATH (or by looking at all your environment variables env). This is a per terminal session thing, so best to place this in your shell's rc file (like ~/.bashrc or ~/.zshrc). The reason my nvcc location changes in the which command, from above, is because of this export command.

Hopefully this fixes it! If not there's still some things to check. You can import os into some of the torch_utils files and check that the proper version of nvcc is being loaded. Or it could be another environment variable.

If these things don't work let me know and we'll dig in a bit further. It may also be useful to look up the issue pages on StyleGAN2 and StyleGAN3 as there will be users likely hitting similar issues.

stevenwalton commented 1 year ago

This is being closed for now due to lack of activity.

Note that I just pushed some changes that may make the inference a bit easier.