Creating environment results in not being able to train

PDillis / stylegan3-fun

Modifications of the official PyTorch implementation of StyleGAN3. Let's easily generate images and videos with StyleGAN2/2-ADA/3!

Other

230 stars 36 forks source link

Creating environment results in not being able to train #7

Closed ZibbeZabbe closed 2 years ago

ZibbeZabbe commented 2 years ago

Describe the bug Creating environment results in pytorch CPU being downloaded Clip by openAI addition results in torch 1.7.1 being downloaded, unsure if that was cause for pytorch CPU version

To Reproduce I would run "Conda clean -a" and "pip cache purge" Then attemp to build environment. Doing so would not allow me to train using "python train.py --outdir=C:\AI\output\stylegan --cfg=stylegan3-r --data=C:\AI\data\data-512x512.zip --gpus=1 --batch=12 --gamma=8.2 --mirror=1" or similar commands

Expected behavior running train.py not erroring out

Screenshots

Desktop (please complete the following information):

OS: Win 11
PyTorch version pytorch 1.7.1
CUDA toolkit version 11.3
NVIDIA driver version 511.79
GPU RTX 3090
Docker: no
Anaconda: miniconda

ZibbeZabbe commented 2 years ago

Issue adressed in PR #8

PDillis commented 2 years ago

Thanks for letting me know of this issue! I'll try with your fix, and see if all the code is correctly executed, but it looks like everything should still work.

PDillis commented 2 years ago

I've found a solution that worked on both my Windows 10 and Ubuntu 18.04 machines: only use the default channels and specify the channel for each dependency, like so:

name: stylegan3
channels:
  - defaults
dependencies:
  - python >= 3.8
  - pip
  - numpy>=1.20
  - click>=8.0
  - pillow=8.3.1
  - scipy=1.7.1
  - pytorch::pytorch>=1.9.1
  - nvidia::cudatoolkit>=11.1  # PR #116 by @edstoica
  - requests=2.26.0
  - tqdm=4.62.2
  - ninja=1.10.2
  - matplotlib=3.4.2
  - imageio=2.9.0
  - pip:
    - imgui==1.3.0
    - glfw==2.2.0
    - pyopengl==3.1.5
    - imageio-ffmpeg==0.4.3
    - pyspng
    - psutil  # PR #125 by @fastflair / #111 by @siddharthksah
    - tensorboard  # PR #125 by @fastflair
    - moviepy==1.0.3
    - ffmpeg-python==0.2.0
    - scikit-video==1.1.11
    - setuptools==59.5.0

Test it out and let me know if it works for you. If it does, you can change it on your PR and I'll accept it. Thanks again for pointing out the bad environment creation!

ZibbeZabbe commented 2 years ago

Using pytorch::pytorch>=1.9.1 resulted in version 1.11.0 for me which has the issue described in #145 on the NVlabs issue.

Specifying <=1.10.2 should solve that issue. Unfortunately, with channel default I still have the issue of not getting the CUDA version of pytorcn (only shows up when attempting to train)

as such specifying pytorch=1.10.2=py3.9_cuda11.3_cudnn8_0 has been the only reliable way I found to ensure CUDA compiled pytorch is grabbed. Its not a perfect solution as this may not work with other versions of python but it is functional.

PDillis commented 2 years ago

Fixed in: 2d0a7c2

In short: thanks to the last fix in the NVlabs repository for NVlabs#145, we also change cudatoolkit=11.1 in environment.yml and the environment is correctly created in both Windows and Ubuntu 18.04. I've tested the code and we can generate images/videos, as well as train with it, so let me know if there's anything else to fix!

nuclearsugar commented 1 year ago

The environment will not build when starting from a clean slate.

Within environment.yml, changing nvidia::cudatoolkit=11.3 to cudatoolkit=11.3 allowed conda to build the enviroment.