Open ProGamerGov opened 2 years ago
I got a bit further by doing:
huggingface-cli login
accelerate launch train_dreambooth.py --save_sample_prompt "a photo of sks <concept>" --pretrained_model_name_or_path "v1-5-pruned-emaonly.ckpt" --instance_data_dir "training_images" --class_data_dir "<concept>" --output_dir "text-inversion-model" --with_prior_preservation --prior_loss_weight 1.0 --instance_prompt "photo of sks <concept>" --class_prompt "<concept>" --seed 1337 --resolution 512 --train_batch_size 1 --train_text_encoder --mixed_precision "no" --gradient_accumulation_steps 1 --learning_rate 1e-6 --lr_scheduler "constant" --lr_warmup_steps 0 --num_class_images 2000 --sample_batch_size 4 --max_train_steps 15000 --save_interval 500 --pretrained_vae_name_or_path "vae-ft-ema-560000-ema-pruned.ckpt"
But it still ends up doing nothing with no indication of what's wrong.
The script just hangs, with indication of any errors or progress:
user@instance-1:~$ accelerate launch train_dreambooth.py --save_sample_prompt "a photo of sks <concept>" --pretrained_model_name_or_path "runwayml/stable-diffusion-v1-5" --pretrained_vae_name_or_path="stabilityai/sd-vae-ft-ema" --instance_data_dir "training_images" --class_data_dir <concept> --output_dir "text-inversion-model" --with_prior_preservation --prior_loss_weight 1.0 --instance_prompt "photo of sks <concept>" --class_prompt "<concept>" --seed 1337 --resolution 512 --train_batch_size 1 --train_text_encoder --mixed_precision "no" --gradient_accumulation_steps 1 --learning_rate 1e-6 --lr_scheduler "constant" --lr_warmup_steps 0 --num_class_images 2000 --sample_batch_size 4 --max_train_steps 15000 --save_interval 500
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_cpu_threads_per_process` was set to `6` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[!] Not using xformers memory efficient attention.
/opt/conda/lib/python3.7/site-packages/accelerate/accelerator.py:179: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
This is what I have install on the instance:
Collecting environment information...
PyTorch version: 1.11.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 10 (buster) (x86_64)
GCC version: (Debian 8.3.0-6) 8.3.0
Clang version: Could not collect
CMake version: version 3.24.1
Libc version: glibc-2.10
Python version: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-4.19.0-20-cloud-amd64-x86_64-with-debian-10.13
Is CUDA available: True
CUDA runtime version: 11.3.109
CUDA_MODULE_LOADING set to:
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-40GB
Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.19.5
[pip3] torch==1.11.0
[pip3] torch-xla==1.11
[pip3] torchvision==0.12.0+cu113
[conda] blas 2.115 mkl conda-forge
[conda] blas-devel 3.9.0 15_linux64_mkl conda-forge
[conda] cudatoolkit 11.3.1 ha36c431_9 nvidia
[conda] dlenv-pytorch-1-11-gpu 1.0.20220630 py37hc1c1d6d_0 file:///tmp/conda-pkgs
[conda] libblas 3.9.0 15_linux64_mkl conda-forge
[conda] libcblas 3.9.0 15_linux64_mkl conda-forge
[conda] liblapack 3.9.0 15_linux64_mkl conda-forge
[conda] liblapacke 3.9.0 15_linux64_mkl conda-forge
[conda] mkl 2022.1.0 h84fe81f_915 conda-forge
[conda] mkl-devel 2022.1.0 ha770c72_916 conda-forge
[conda] mkl-include 2022.1.0 h84fe81f_915 conda-forge
[conda] numpy 1.19.5 py37h3e96413_3 conda-forge
[conda] pytorch 1.11.0 py3.7_cuda11.3_cudnn8.2.0_0 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchvision 0.12.0+cu113 pypi_0 pypi
pip list output:
Package Version
------------------------------------- -----------------
accelerate 0.12.0
aiohttp 3.8.1
aiosignal 1.2.0
ansiwrap 0.8.4
anyio 3.6.1
appdirs 1.4.4
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
arrow 1.2.2
asn1crypto 1.5.1
async-timeout 4.0.2
asynctest 0.13.0
attrs 21.4.0
Babel 2.10.3
backcall 0.2.0
backports.functools-lru-cache 1.6.4
bcrypt 4.0.1
beatrix-jupyterlab 3.1.7
beautifulsoup4 4.11.1
binaryornot 0.4.4
bitsandbytes 0.35.0
black 22.6.0
bleach 5.0.1
blinker 1.4
brotlipy 0.7.0
cachetools 5.0.0
certifi 2022.6.15
cffi 1.15.0
chardet 5.0.0
charset-normalizer 2.1.0
click 8.1.3
cloudpickle 2.1.0
cmake 3.24.1.1
colorama 0.4.5
conda 4.13.0
conda-package-handling 1.8.1
cookiecutter 2.1.1
cryptography 37.0.2
cycler 0.11.0
dataclasses 0.8
debugpy 1.6.0
decorator 5.1.1
defusedxml 0.7.1
diffusers 0.7.0.dev0
docker 5.0.3
docker-pycreds 0.4.0
entrypoints 0.4
fastapi 0.85.1
fastjsonschema 2.15.3
ffmpy 0.3.0
filelock 3.8.0
flit_core 3.7.1
fonttools 4.33.3
frozenlist 1.3.0
fsspec 2022.5.0
ftfy 6.1.1
gcsfs 2022.5.0
gitdb 4.0.9
GitPython 3.1.27
google-api-core 2.8.1
google-api-python-client 2.52.0
google-auth 2.9.0
google-auth-httplib2 0.1.0
google-auth-oauthlib 0.5.2
google-cloud-aiplatform 1.15.0
google-cloud-appengine-logging 1.1.2
google-cloud-audit-log 0.2.2
google-cloud-bigquery 2.34.4
google-cloud-bigquery-storage 2.13.2
google-cloud-bigtable 2.10.1
google-cloud-core 2.3.1
google-cloud-dataproc 4.0.3
google-cloud-datastore 2.7.1
google-cloud-firestore 2.5.3
google-cloud-kms 2.11.2
google-cloud-language 2.4.3
google-cloud-logging 3.1.2
google-cloud-monitoring 2.9.2
google-cloud-pubsub 1.7.0
google-cloud-resource-manager 1.5.1
google-cloud-scheduler 2.6.4
google-cloud-spanner 3.15.1
google-cloud-speech 2.14.1
google-cloud-storage 2.4.0
google-cloud-tasks 2.9.1
google-cloud-translate 3.7.4
google-cloud-videointelligence 2.7.1
google-cloud-vision 2.7.3
google-crc32c 1.1.2
google-resumable-media 2.3.3
googleapis-common-protos 1.56.3
gradio 3.6
greenlet 1.1.2
grpc-google-iam-v1 0.12.4
grpcio 1.47.0
grpcio-gcp 0.2.2
grpcio-status 1.47.0
h11 0.12.0
htmlmin 0.1.12
httpcore 0.15.0
httplib2 0.20.4
httpx 0.23.0
huggingface-hub 0.10.1
idna 3.3
ImageHash 4.2.1
importlib-metadata 4.11.4
importlib-resources 5.8.0
ipykernel 6.15.0
ipython 7.33.0
ipython-genutils 0.2.0
ipython-sql 0.3.9
jedi 0.18.1
jeepney 0.8.0
Jinja2 3.1.2
jinja2-time 0.2.0
joblib 1.1.0
json5 0.9.5
jsonschema 4.6.1
jupyter-client 7.3.4
jupyter-core 4.10.0
jupyter-http-over-ws 0.0.8
jupyter-server 1.18.0
jupyter-server-mathjax 0.2.5
jupyter-server-proxy 3.2.1
jupyterlab 3.2.9
jupyterlab-git 0.37.1
jupyterlab-pygments 0.2.2
jupyterlab-server 2.14.0
jupytext 1.13.8
keyring 23.6.0
keyrings.google-artifactregistry-auth 1.0.0
kiwisolver 1.4.3
kubernetes 24.2.0
linkify-it-py 1.0.3
llvmlite 0.38.1
Markdown 3.3.7
markdown-it-py 2.1.0
MarkupSafe 2.1.1
matplotlib 3.5.2
matplotlib-inline 0.1.3
mdit-py-plugins 0.3.0
mdurl 0.1.0
missingno 0.4.2
mistune 0.8.4
multidict 6.0.2
multimethod 1.4
munkres 1.1.4
mypy-extensions 0.4.3
nb-conda 2.2.1
nb-conda-kernels 2.3.1
nbclassic 0.3.7
nbclient 0.6.5
nbconvert 6.5.0
nbdime 3.1.1
nbformat 5.4.0
nest-asyncio 1.5.5
networkx 2.7.1
notebook 6.4.12
notebook-executor 0.2
notebook-shim 0.1.0
numba 0.55.2
numpy 1.19.5
oauthlib 3.2.0
orjson 3.8.0
packaging 21.3
pandas 1.3.5
pandas-profiling 3.2.0
pandocfilters 1.5.0
papermill 2.3.4
paramiko 2.11.0
parso 0.8.3
pathspec 0.9.0
patsy 0.5.2
pexpect 4.8.0
phik 0.12.2
pickleshare 0.7.5
Pillow 9.1.1
pip 22.1.2
platformdirs 2.5.1
pluggy 1.0.0
prettytable 3.3.0
prometheus-client 0.14.1
prompt-toolkit 3.0.30
proto-plus 1.20.6
protobuf 3.20.1
psutil 5.9.1
ptyprocess 0.7.0
pyarrow 8.0.0
pyasn1 0.4.8
pyasn1-modules 0.2.7
pycosat 0.6.3
pycparser 2.21
pycryptodome 3.15.0
pydantic 1.9.1
pydub 0.25.1
Pygments 2.12.0
PyJWT 2.4.0
PyNaCl 1.5.0
pyOpenSSL 22.0.0
pyparsing 3.0.9
pyrsistent 0.18.1
PySocks 1.7.1
python-dateutil 2.8.2
python-multipart 0.0.5
python-slugify 6.1.2
pytz 2022.1
pyu2f 0.1.5
PyWavelets 1.3.0
PyYAML 6.0
pyzmq 23.2.0
regex 2022.9.13
requests 2.28.1
requests-oauthlib 1.3.1
retrying 1.3.3
rfc3986 1.5.0
rsa 4.8
ruamel-yaml-conda 0.15.100
scikit-learn 1.0.2
scipy 1.7.3
seaborn 0.11.2
SecretStorage 3.3.2
Send2Trash 1.8.0
setuptools 59.8.0
simpervisor 0.4
six 1.16.0
smmap 3.0.5
sniffio 1.2.0
soupsieve 2.3.1
SQLAlchemy 1.4.39
sqlparse 0.4.2
starlette 0.20.4
statsmodels 0.13.2
tangled-up-in-unicode 0.2.0
tenacity 8.0.1
terminado 0.15.0
text-unidecode 1.3
textwrap3 0.9.2
threadpoolctl 3.1.0
tinycss2 1.1.1
tokenizers 0.13.1
toml 0.10.2
tomli 2.0.1
torch 1.11.0
torch-xla 1.11
torchvision 0.12.0+cu113
tornado 6.1
tqdm 4.64.0
traitlets 5.3.0
transformers 4.23.1
triton 2.0.0.dev20221014
typed-ast 1.5.4
typing_extensions 4.2.0
uc-micro-py 1.0.1
ujson 5.3.0
unicodedata2 14.0.0
Unidecode 1.3.4
uritemplate 4.1.1
urllib3 1.26.9
uvicorn 0.19.0
visions 0.7.4
wcwidth 0.2.5
webencodings 0.5.1
websocket-client 1.3.3
websockets 10.3
wheel 0.37.1
wrapt 1.14.1
yarl 1.7.2
zipp 3.8.0
@ShivamShrirao Any ideas on why it doesn't work?
Can't say. Btw you should install xformers. To check where script is hanging, press ctrl+C. The traceback will show where it was stuck.
@ShivamShrirao I did some more testing and it looks like hangs on the following line for some reason:
if args.seed is not None:
set_seed(args.seed)
When I omitted the seed parameter, everything worked.
Without using the seed parameter, it makes it up to this line before it stops working again:
accelerator.backward(loss)
I tried looking for similar issues:
https://github.com/huggingface/accelerate/issues/287
https://github.com/huggingface/accelerate/issues/191
But I'm not sure why its hanging on this line for this repo.
These are the parameters that I'm using:
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
export VAE_NAME="stabilityai/sd-vae-ft-mse"
export INSTANCE_DIR="concept_images"
export CLASS_DIR="class_reg_images"
export OUTPUT_DIR="path-to-save-model"
accelerate launch train_dreambooth.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--pretrained_vae_name_or_path=$VAE_NAME \
--instance_data_dir=$INSTANCE_DIR \
--class_data_dir=$CLASS_DIR \
--output_dir=$OUTPUT_DIR \
--with_prior_preservation --prior_loss_weight=1.0 \
--instance_prompt="a photo of sks <concept>" \
--class_prompt="<concept>" \
--save_sample_prompt="photo of sks <concept>" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--learning_rate=5e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--num_class_images=1290 \
--save_interval=500 \
--max_train_steps=25000 \
--train_text_encoder \
--mixed_precision="no" \
--not_cache_latents
Edit:
This issue may be related? https://github.com/pytorch/pytorch/issues/85841
Can't say. Btw you should install xformers. To check where script is hanging, press ctrl+C. The traceback will show where it was stuck.
I need to find a pre-compiled xformers binary for the A100 40GB card first.
Edit:
I just tried using this version of xformers and got the same issue:
pip install -q https://github.com/TheLastBen/fast-stable-diffusion/raw/main/precompiled/A100/xformers-0.0.13.dev0-py3-none-any.whl
user@instance-1:~$ sh launch.sh
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_cpu_threads_per_process` was set to `6` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
WARNING:root:A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 1.06M/1.06M [00:00<00:00, 2.89MB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 525k/525k [00:00<00:00, 2.18MB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 472/472 [00:00<00:00, 471kB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 806/806 [00:00<00:00, 848kB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 617/617 [00:00<00:00, 607kB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 492M/492M [00:06<00:00, 72.7MB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 335M/335M [00:04<00:00, 73.6MB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 547/547 [00:00<00:00, 522kB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 3.44G/3.44G [00:48<00:00, 70.2MB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 743/743 [00:00<00:00, 721kB/s]
Steps: 0%| | 0/25000 [00:00<?, ?it/s]
I was able to get it working!
I created a file called environment.yaml
and put this inside:
name: ldm
channels:
- pytorch
- defaults
dependencies:
- python=3.8.10
- pip=20.3
- cudatoolkit=11.3
- pip:
- git+https://github.com/ShivamShrirao/diffusers.git
- accelerate==0.12.0
- torchvision
- transformers>=4.21.0
- ftfy
- tensorboard
- modelcards
Next I ran:
conda env create -f environment.yaml
Followed by:
conda activate ldm
After running the dreambooth script, it finnally gave be an error:
NVIDIA A100-SXM4-40GB with CUDA capability sm_80 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A100-SXM4-40GB GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
So, ran the following code and now the dreambooth script seems to work!
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
I'm having trouble repeating my above success, even when using the exact same commands:
wget -q https://github.com/ShivamShrirao/diffusers/raw/main/examples/dreambooth/train_dreambooth.py
conda env create -f conda.yaml
conda activate ldm
huggingface-cli login
pip install -q https://github.com/TheLastBen/fast-stable-diffusion/raw/main/precompiled/A100/xformers-0.0.13.dev0-py3-none-any.whl
accelerate config
pip install triton
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
sh launch.sh
sh launch.sh
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_cpu_threads_per_process` was set to `6` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
libc10_cuda.so: cannot open shared object file: No such file or directory
WARNING:root:WARNING: libc10_cuda.so: cannot open shared object file: No such file or directory
Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 1.06M/1.06M [00:00<00:00, 2.89MB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 525k/525k [00:00<00:00, 2.17MB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 472/472 [00:00<00:00, 494kB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 806/806 [00:00<00:00, 856kB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 617/617 [00:00<00:00, 619kB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 492M/492M [00:04<00:00, 100MB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 335M/335M [00:03<00:00, 100MB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 547/547 [00:00<00:00, 530kB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 3.44G/3.44G [00:35<00:00, 98.2MB/s]
Downloading: 100%|āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā| 743/743 [00:00<00:00, 678kB/s]
Steps: 0%| | 0/25000 [00:00<?, ?it/s]Traceback (most recent call last):
File "train_dreambooth.py", line 765, in <module>
main()
File "train_dreambooth.py", line 712, in main
noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
File "/opt/conda/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.8/site-packages/diffusers/models/unet_2d_condition.py", line 296, in forward
sample, res_samples = downsample_block(
File "/opt/conda/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.8/site-packages/diffusers/models/unet_blocks.py", line 563, in forward
hidden_states = attn(hidden_states, context=encoder_hidden_states)
File "/opt/conda/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.8/site-packages/diffusers/models/attention.py", line 169, in forward
hidden_states = block(hidden_states, context=context)
File "/opt/conda/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.8/site-packages/diffusers/models/attention.py", line 218, in forward
hidden_states = self.attn1(self.norm1(hidden_states)) + hidden_states
File "/opt/conda/envs/ldm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/ldm/lib/python3.8/site-packages/diffusers/models/attention.py", line 291, in forward
hidden_states = xformers.ops.memory_efficient_attention(query, key, value)
File "/opt/conda/envs/ldm/lib/python3.8/site-packages/xformers/ops.py", line 617, in memory_efficient_attention
op = AttentionOpDispatch.from_arguments(
File "/opt/conda/envs/ldm/lib/python3.8/site-packages/xformers/ops.py", line 580, in op
raise NotImplementedError(f"No operator found for this attention: {self}")
NotImplementedError: No operator found for this attention: AttentionOpDispatch(dtype=torch.float32, device=device(type='cpu'), k=40, has_dropout=False, attn_bias_type=<class 'NoneType'>, kv_len=4096, q_len=4096)
Steps: 0%| | 0/25000 [00:04<?, ?it/s]
Traceback (most recent call last):
File "/opt/conda/envs/ldm/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/opt/conda/envs/ldm/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/opt/conda/envs/ldm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 837, in launch_command
simple_launcher(args)
File "/opt/conda/envs/ldm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 354, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/envs/ldm/bin/python', 'train_dreambooth.py',
Same error reported here: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/1975
Seems like its a PyTorch version issue: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/576#issuecomment-1250136231
The new error I have was reported here as well: https://github.com/ShivamShrirao/diffusers/issues/26
Looking at the log for when I succeeded, I see the following PyTorch / Cuda versions:
torchvision pytorch/linux-64::torchvision-0.13.1-py38_cu113 None
pytorch pytorch/linux-64::pytorch-1.12.1-py3.8_cuda11.3_cudnn8.3.2_0 None
pytorch-1.12.1
torchvision-0.13.1
So, maybe the versions are somehow getting messed up?
I think that it may have been the PyTorch version. I tried using the this environment.yaml
file:
name: ldm
channels:
- pytorch
- defaults
dependencies:
- python=3.8.10
- pip=20.3
- cudatoolkit=11.3
- pytorch=1.12.1
- torchvision=0.13.1
- pip:
- git+https://github.com/ShivamShrirao/diffusers.git
- triton
- accelerate==0.12.0
- torchvision
- transformers>=4.21.0
- ftfy
- tensorboard
- modelcards
And I used it as part of these commands:
wget -q https://github.com/ShivamShrirao/diffusers/raw/main/examples/dreambooth/train_dreambooth.py
conda env create -f environment.yaml
conda activate ldm
pip install -q https://github.com/TheLastBen/fast-stable-diffusion/raw/main/precompiled/A100/xformers-0.0.13.dev0-py3-none-any.whl
huggingface-cli login
And it worked!
@ProGamerGov what base docker image did you use in that case?
Describe the bug
I tried running the code earlier, and nothing seemed to happen after I ran the script via cmd:
I literally start the instance, upload my images, download the models and run the following code:
I tried again without the accelerate stuff:
It'd be helpful if there was some sort of indication if stuff was happening behind the scenes.
Reproduction
No response
Logs
No response
System Info
Debian Instance on GCP with an A100 40GB graphics card.