PixArt-alpha / PixArt-sigma

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
https://pixart-alpha.github.io/PixArt-sigma-project/
GNU Affero General Public License v3.0
1.47k stars 70 forks source link

where to get pixart_omega_sdxl_256px_diffusers_from512 ? #13

Closed zba closed 3 months ago

zba commented 3 months ago

I "fixed" the problem from #11 and downloaded models--madebyollin--sdxl-vae-fp16-fix but there another model seems required, how to obtain it ?

errors, during run train script:

app-1  | 2024-03-30 10:08:37,889 - PixArt - INFO - World_size: 1, seed: 43
app-1  | 2024-03-30 10:08:37,889 - PixArt - INFO - Initializing: DDP for training
app-1  | Traceback (most recent call last):
app-1  |   File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 398, in cached_file
app-1  |     resolved_file = hf_hub_download(
app-1  |   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 111, in _inner_fn
app-1  |     validate_repo_id(arg_value)
app-1  |   File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 159, in validate_repo_id
app-1  |     raise HFValidationError(
app-1  | huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'output/pretrained_models/pixart_omega_sdxl_256px_diffusers_from512'. Use `repo_type` argument if needed.
complete log: ``` app-1 | app-1 | ========== app-1 | == CUDA == app-1 | ========== app-1 | app-1 | CUDA Version 12.1.1 app-1 | app-1 | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. app-1 | app-1 | This container image and its contents are governed by the NVIDIA Deep Learning Container License. app-1 | By pulling and using the container, you accept the terms and conditions of this license: app-1 | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license app-1 | app-1 | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience. app-1 | app-1 | /usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated app-1 | and will be removed in future. Use torchrun. app-1 | Note that --use-env is set by default in torchrun. app-1 | If your script expects `--local-rank` argument to be set, please app-1 | change it to read from `os.environ['LOCAL_RANK']` instead. See app-1 | https://pytorch.org/docs/stable/distributed.html#launch-utility for app-1 | further instructions app-1 | app-1 | warnings.warn( app-1 | /usr/local/lib/python3.10/dist-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details. app-1 | warnings.warn( app-1 | File renamed to: output/your_first_exp/train_log_2024-03-30_10-06-59.log app-1 | 2024-03-30 10:08:37,869 - PixArt - INFO - Distributed environment: MULTI_GPU Backend: nccl app-1 | Num processes: 1 app-1 | Process index: 0 app-1 | Local process index: 0 app-1 | Device: cuda:0 app-1 | app-1 | Mixed precision type: fp16 app-1 | app-1 | 2024-03-30 10:08:37,889 - PixArt - INFO - Config: app-1 | data_root = 'pixart-sigma-toy-dataset' app-1 | data = dict( app-1 | type='InternalDataMSSigma', app-1 | root='InternData', app-1 | image_list_json=['data_info.json'], app-1 | transform='default_train', app-1 | load_vae_feat=False, app-1 | load_t5_feat=False) app-1 | image_size = 512 app-1 | train_batch_size = 2 app-1 | eval_batch_size = 16 app-1 | use_fsdp = False app-1 | valid_num = 0 app-1 | fp32_attention = True app-1 | model = 'PixArtMS_XL_2' app-1 | aspect_ratio_type = 'ASPECT_RATIO_512' app-1 | multi_scale = True app-1 | pe_interpolation = 1.0 app-1 | qk_norm = False app-1 | kv_compress = False app-1 | kv_compress_config = dict(sampling=None, scale_factor=1, kv_compress_layer=[]) app-1 | num_workers = 10 app-1 | train_sampling_steps = 1000 app-1 | visualize = False app-1 | eval_sampling_steps = 500 app-1 | model_max_length = 300 app-1 | lora_rank = 4 app-1 | num_epochs = 10 app-1 | gradient_accumulation_steps = 1 app-1 | grad_checkpointing = True app-1 | gradient_clip = 0.01 app-1 | gc_step = 1 app-1 | auto_lr = dict(rule='sqrt') app-1 | optimizer = dict( app-1 | type='CAMEWrapper', app-1 | lr=2e-05, app-1 | weight_decay=0.0, app-1 | eps=(1e-30, 1e-16), app-1 | betas=(0.9, 0.999, 0.9999)) app-1 | lr_schedule = 'constant' app-1 | lr_schedule_args = dict(num_warmup_steps=1000) app-1 | save_image_epochs = 1 app-1 | save_model_epochs = 5 app-1 | save_model_steps = 2500 app-1 | sample_posterior = True app-1 | mixed_precision = 'fp16' app-1 | scale_factor = 0.13025 app-1 | ema_rate = 0.9999 app-1 | tensorboard_mox_interval = 50 app-1 | log_interval = 1 app-1 | cfg_scale = 4 app-1 | mask_type = 'null' app-1 | num_group_tokens = 0 app-1 | mask_loss_coef = 0.0 app-1 | load_mask_index = False app-1 | vae_pretrained = 'output/pretrained_models/models--madebyollin--sdxl-vae-fp16-fix' app-1 | load_from = None app-1 | resume_from = None app-1 | snr_loss = False app-1 | real_prompt_ratio = 0.5 app-1 | class_dropout_prob = 0.1 app-1 | work_dir = 'output/your_first_exp' app-1 | s3_work_dir = None app-1 | micro_condition = False app-1 | seed = 43 app-1 | skip_step = 0 app-1 | loss_type = 'huber' app-1 | huber_c = 0.001 app-1 | num_ddim_timesteps = 50 app-1 | w_max = 15.0 app-1 | w_min = 3.0 app-1 | ema_decay = 0.95 app-1 | image_list_json = ['data_info.json'] app-1 | app-1 | 2024-03-30 10:08:37,889 - PixArt - INFO - World_size: 1, seed: 43 app-1 | 2024-03-30 10:08:37,889 - PixArt - INFO - Initializing: DDP for training app-1 | Traceback (most recent call last): app-1 | File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 398, in cached_file app-1 | resolved_file = hf_hub_download( app-1 | File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 111, in _inner_fn app-1 | validate_repo_id(arg_value) app-1 | File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 159, in validate_repo_id app-1 | raise HFValidationError( app-1 | huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'output/pretrained_models/pixart_omega_sdxl_256px_diffusers_from512'. Use `repo_type` argument if needed. app-1 | app-1 | The above exception was the direct cause of the following exception: app-1 | app-1 | Traceback (most recent call last): app-1 | File "/pixart-sigma/train_scripts/train.py", line 344, in app-1 | tokenizer = T5Tokenizer.from_pretrained(args.pipeline_load_from, subfolder="tokenizer") app-1 | File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2007, in from_pretrained app-1 | resolved_config_file = cached_file( app-1 | File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 462, in cached_file app-1 | raise EnvironmentError( app-1 | OSError: Incorrect path_or_model_id: 'output/pretrained_models/pixart_omega_sdxl_256px_diffusers_from512'. Please provide either the path to a local folder or the repo_id of a model on the Hub. app-1 | [2024-03-30 10:08:41,074] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 62) of binary: /usr/bin/python app-1 | Traceback (most recent call last): app-1 | File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main app-1 | return _run_code(code, main_globals, None, app-1 | File "/usr/lib/python3.10/runpy.py", line 86, in _run_code app-1 | exec(code, run_globals) app-1 | File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 198, in app-1 | main() app-1 | File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 194, in main app-1 | launch(args) app-1 | File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 179, in launch app-1 | run(args) app-1 | File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run app-1 | elastic_launch( app-1 | File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__ app-1 | return launch_agent(self._config, self._entrypoint, list(args)) app-1 | File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent app-1 | raise ChildFailedError( app-1 | torch.distributed.elastic.multiprocessing.errors.ChildFailedError: app-1 | ============================================================ app-1 | train_scripts/train.py FAILED app-1 | ------------------------------------------------------------ app-1 | Failures: app-1 | app-1 | ------------------------------------------------------------ app-1 | Root Cause (first observed failure): app-1 | [0]: app-1 | time : 2024-03-30_10:08:41 app-1 | host : abd0123ba110 app-1 | rank : 0 (local_rank: 0) app-1 | exitcode : 1 (pid: 62) app-1 | error_file: app-1 | traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html app-1 | ============================================================ app-1 exited with code 1 ```

Upd: it seems meant any alpha models, i.e. PixArt-alpha/PixArt-XL-2-512x512, installed it and add to cmd --pipeline_load_from /pixart-sigma/output/pretrained_models/PixArt-alpha_PixArt-XL-2-512x512 (change path to where you downloaded alpha model) And it works, now.