Closed Bigfield77 closed 3 months ago
seems it old code left there, I removed .SA and .DreamBooth (using
RUN perl -pi -e 's/from .SA import SAM\n|from .Dreambooth import DreamBooth\n//g' /pixart-sigma/diffusion/data/datasets/__init__.py
in my docker file)
but there another errors
I removed both .SA and .Dreambooth lines, it now complained about a missing came_pytorch module:
Traceback (most recent call last):
File "D:\stable\pixartSigma\PixArt-sigma\train_scripts\train.py", line 32, in
installing came_pytorch from pip leads to the following once trying to run again:
[W socket.cpp:697] [c10d] The client socket has failed to connect to [Rook]:12345 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
File "D:\stable\pixartSigma\PixArt-sigma\train_scripts\train.py", line 306, in
Just install the module, next look my other issue, I was able to train on single 4090, 10 epochs, but Inference not works - out of memory, will check tomorrow if i can push it forward
Ah, sorry, seems you installed it already, didn't saw your error, may be you need to add extra index to nvidia repo when install requirements and missed module, will check tomorrow, if you need which one
Cheers! It seems to be due to NCCL which seems to be useful for distributed training. I will try to see if I can disable that since I don't have multiple gpus
Cheers! It seems to be due to NCCL which seems to be useful for distributed training. I will try to see if I can disable that since I don't have multiple gpus
Try this
python train_scripts/train.py \
configs/pixart_sigma_config/PixArt_sigma_xl2_img512_internalms.py \
--work-dir output/your_first_exp \
--debug \
--pipeline_load_from /pixart-sigma/output/pretrained_models/PixArt-alpha_PixArt-XL-2-512x512
You will also need vae model
ah, it goes further without asking for the training to be distributed with: python -m torch.distributed.launch --nproc_per_node=1 --master_port=12345 :)
I found a windows specific bug: train.py line 443 shouldn't have colons in the timestamp because windows doesn't allow it in filenames: timestamp = time.strftime("%Y-%m-%d%H:%M:%S", time.localtime()) I replaced with timestamp = time.strftime("%Y-%m-%d%H%M%S", time.localtime()) and it goes further
now it complains about not being able to pickle local file
Goes further if i set the num_workers to 0..
2024-04-01 00:39:09,458 - PixArt - INFO - Dataset InternalDataMSSigma constructed. time: 0.00 s, length (use/ori): 88/88
2024-04-01 00:39:09,458 - PixArt - WARNING - Using valid_num=0 in config file. Available 40 aspect_ratios: ['0.25', '0.26', '0.27', '0.28', '0.32', '0.33', '0.35', '0.4', '0.42', '0.48', '0.5', '0.52', '0.57', '0.6', '0.68', '0.72', '0.78', '0.82', '0.88', '0.94', '1.0', '1.07', '1.13', '1.21', '1.29', '1.38', '1.46', '1.67', '1.75', '2.0', '2.09', '2.4', '2.5', '2.89', '3.0', '3.11', '3.62', '3.75', '3.88', '4.0']
2024-04-01 00:39:09,458 - PixArt - INFO - Automatically adapt lr to 0.00000 (using sqrt scaling rule).
2024-04-01 00:39:09,536 - PixArt - INFO - CAMEWrapper Optimizer: total 435 param groups, 435 are learnable, 0 are fix. Lr group: 435 params with lr 0.00000; Weight decay group: 435 params with weight decay 0.0.
2024-04-01 00:39:09,536 - PixArt - INFO - Lr schedule: constant, num_warmup_steps:1000.
Traceback (most recent call last):
File "D:\stable\pixartSigma\PixArt-sigma\train_scripts\train.py", line 482, in
module
in model.module.h
is only for multi-GPUs. Replacing with s:({model.h}, {model.w}),
if you train with single GPU. BTW, the SA and other bugs mentioned in this issue are fixed. THX for noticing.
Thanks! I was able to train by applying the change :)
Hello, thanks for your work on Sigma!
I get the following error when trying to launch training on the toy dataset:
Traceback (most recent call last): File "D:\stable\pixartSigma\PixArt-sigma\train_scripts\train.py", line 24, in
from diffusion.data.builder import build_dataset, build_dataloader, set_data_root
File "D:\stable\pixartSigma\PixArt-sigma\diffusion\data__init.py", line 1, in
from .datasets import *
File "D:\stable\pixartSigma\PixArt-sigma\diffusion\data\datasets\ init__.py", line 1, in
from .SA import SAM
ModuleNotFoundError: No module named 'diffusion.data.datasets.SA'