PixArt-alpha / PixArt-sigma

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
https://pixart-alpha.github.io/PixArt-sigma-project/
GNU Affero General Public License v3.0
1.47k stars 70 forks source link

Missing diffusion.data.datasets.SA #11

Closed Bigfield77 closed 3 months ago

Bigfield77 commented 3 months ago

Hello, thanks for your work on Sigma!

I get the following error when trying to launch training on the toy dataset:

Traceback (most recent call last): File "D:\stable\pixartSigma\PixArt-sigma\train_scripts\train.py", line 24, in from diffusion.data.builder import build_dataset, build_dataloader, set_data_root File "D:\stable\pixartSigma\PixArt-sigma\diffusion\data__init.py", line 1, in from .datasets import * File "D:\stable\pixartSigma\PixArt-sigma\diffusion\data\datasets\init__.py", line 1, in from .SA import SAM ModuleNotFoundError: No module named 'diffusion.data.datasets.SA'

zba commented 3 months ago

seems it old code left there, I removed .SA and .DreamBooth (using

RUN perl -pi -e 's/from .SA import SAM\n|from .Dreambooth import DreamBooth\n//g' /pixart-sigma/diffusion/data/datasets/__init__.py 

in my docker file)

but there another errors

Bigfield77 commented 3 months ago

I removed both .SA and .Dreambooth lines, it now complained about a missing came_pytorch module: Traceback (most recent call last): File "D:\stable\pixartSigma\PixArt-sigma\train_scripts\train.py", line 32, in from diffusion.utils.optimizer import build_optimizer, auto_scale_lr File "D:\stable\pixartSigma\PixArt-sigma\diffusion\utils\optimizer.py", line 15, in from came_pytorch import CAME ModuleNotFoundError: No module named 'came_pytorch'

installing came_pytorch from pip leads to the following once trying to run again:

[W socket.cpp:697] [c10d] The client socket has failed to connect to [Rook]:12345 (system error: 10049 - The requested address is not valid in its context.). Traceback (most recent call last): File "D:\stable\pixartSigma\PixArt-sigma\train_scripts\train.py", line 306, in accelerator = Accelerator( File "D:\stable\pixartSigma\venv\lib\site-packages\accelerate\accelerator.py", line 371, in init self.state = AcceleratorState( File "D:\stable\pixartSigma\venv\lib\site-packages\accelerate\state.py", line 777, in init PartialState(cpu, kwargs) File "D:\stable\pixartSigma\venv\lib\site-packages\accelerate\state.py", line 227, in init torch.distributed.init_process_group(backend=self.backend, kwargs) File "D:\stable\pixartSigma\venv\lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper func_return = func(*args, **kwargs) File "D:\stable\pixartSigma\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 1184, in init_process_group defaultpg, = _new_process_group_helper( File "D:\stable\pixartSigma\venv\lib\site-packages\torch\distributed\distributed_c10d.py", line 1302, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL built in") RuntimeError: Distributed package doesn't have NCCL built in

zba commented 3 months ago

Just install the module, next look my other issue, I was able to train on single 4090, 10 epochs, but Inference not works - out of memory, will check tomorrow if i can push it forward

Ah, sorry, seems you installed it already, didn't saw your error, may be you need to add extra index to nvidia repo when install requirements and missed module, will check tomorrow, if you need which one

Bigfield77 commented 3 months ago

Cheers! It seems to be due to NCCL which seems to be useful for distributed training. I will try to see if I can disable that since I don't have multiple gpus

zba commented 3 months ago

Cheers! It seems to be due to NCCL which seems to be useful for distributed training. I will try to see if I can disable that since I don't have multiple gpus

Try this

python train_scripts/train.py \
          configs/pixart_sigma_config/PixArt_sigma_xl2_img512_internalms.py \
          --work-dir output/your_first_exp \
          --debug \
          --pipeline_load_from /pixart-sigma/output/pretrained_models/PixArt-alpha_PixArt-XL-2-512x512

You will also need vae model

Bigfield77 commented 3 months ago

ah, it goes further without asking for the training to be distributed with: python -m torch.distributed.launch --nproc_per_node=1 --master_port=12345 :)

I found a windows specific bug: train.py line 443 shouldn't have colons in the timestamp because windows doesn't allow it in filenames: timestamp = time.strftime("%Y-%m-%d%H:%M:%S", time.localtime()) I replaced with timestamp = time.strftime("%Y-%m-%d%H%M%S", time.localtime()) and it goes further

now it complains about not being able to pickle local file

Bigfield77 commented 3 months ago

Goes further if i set the num_workers to 0..

2024-04-01 00:39:09,458 - PixArt - INFO - Dataset InternalDataMSSigma constructed. time: 0.00 s, length (use/ori): 88/88 2024-04-01 00:39:09,458 - PixArt - WARNING - Using valid_num=0 in config file. Available 40 aspect_ratios: ['0.25', '0.26', '0.27', '0.28', '0.32', '0.33', '0.35', '0.4', '0.42', '0.48', '0.5', '0.52', '0.57', '0.6', '0.68', '0.72', '0.78', '0.82', '0.88', '0.94', '1.0', '1.07', '1.13', '1.21', '1.29', '1.38', '1.46', '1.67', '1.75', '2.0', '2.09', '2.4', '2.5', '2.89', '3.0', '3.11', '3.62', '3.75', '3.88', '4.0'] 2024-04-01 00:39:09,458 - PixArt - INFO - Automatically adapt lr to 0.00000 (using sqrt scaling rule). 2024-04-01 00:39:09,536 - PixArt - INFO - CAMEWrapper Optimizer: total 435 param groups, 435 are learnable, 0 are fix. Lr group: 435 params with lr 0.00000; Weight decay group: 435 params with weight decay 0.0. 2024-04-01 00:39:09,536 - PixArt - INFO - Lr schedule: constant, num_warmup_steps:1000. Traceback (most recent call last): File "D:\stable\pixartSigma\PixArt-sigma\train_scripts\train.py", line 482, in train() File "D:\stable\pixartSigma\PixArt-sigma\train_scripts\train.py", line 202, in train f"epoch_eta:{eta_epoch}, time_all:{t:.3f}, time_data:{t_d:.3f}, lr:{lr:.3e}, s:({model.module.h}, {model.module.w}), " File "D:\stable\pixartSigma\venv\lib\site-packages\torch\nn\modules\module.py", line 1688, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'PixArtMS' object has no attribute 'module'. Did you mean: 'modules'?

lawrence-cj commented 3 months ago

module in model.module.h is only for multi-GPUs. Replacing with s:({model.h}, {model.w}), if you train with single GPU. BTW, the SA and other bugs mentioned in this issue are fixed. THX for noticing.

Bigfield77 commented 3 months ago

Thanks! I was able to train by applying the change :)