I "fixed" the problem from #11 and downloaded models--madebyollin--sdxl-vae-fp16-fix
but there another model seems required, how to obtain it ?
errors, during run train script:
app-1 | 2024-03-30 10:08:37,889 - PixArt - INFO - World_size: 1, seed: 43
app-1 | 2024-03-30 10:08:37,889 - PixArt - INFO - Initializing: DDP for training
app-1 | Traceback (most recent call last):
app-1 | File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 398, in cached_file
app-1 | resolved_file = hf_hub_download(
app-1 | File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 111, in _inner_fn
app-1 | validate_repo_id(arg_value)
app-1 | File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 159, in validate_repo_id
app-1 | raise HFValidationError(
app-1 | huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'output/pretrained_models/pixart_omega_sdxl_256px_diffusers_from512'. Use `repo_type` argument if needed.
complete log:
```
app-1 |
app-1 | ==========
app-1 | == CUDA ==
app-1 | ==========
app-1 |
app-1 | CUDA Version 12.1.1
app-1 |
app-1 | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
app-1 |
app-1 | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
app-1 | By pulling and using the container, you accept the terms and conditions of this license:
app-1 | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
app-1 |
app-1 | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
app-1 |
app-1 | /usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
app-1 | and will be removed in future. Use torchrun.
app-1 | Note that --use-env is set by default in torchrun.
app-1 | If your script expects `--local-rank` argument to be set, please
app-1 | change it to read from `os.environ['LOCAL_RANK']` instead. See
app-1 | https://pytorch.org/docs/stable/distributed.html#launch-utility for
app-1 | further instructions
app-1 |
app-1 | warnings.warn(
app-1 | /usr/local/lib/python3.10/dist-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
app-1 | warnings.warn(
app-1 | File renamed to: output/your_first_exp/train_log_2024-03-30_10-06-59.log
app-1 | 2024-03-30 10:08:37,869 - PixArt - INFO - Distributed environment: MULTI_GPU Backend: nccl
app-1 | Num processes: 1
app-1 | Process index: 0
app-1 | Local process index: 0
app-1 | Device: cuda:0
app-1 |
app-1 | Mixed precision type: fp16
app-1 |
app-1 | 2024-03-30 10:08:37,889 - PixArt - INFO - Config:
app-1 | data_root = 'pixart-sigma-toy-dataset'
app-1 | data = dict(
app-1 | type='InternalDataMSSigma',
app-1 | root='InternData',
app-1 | image_list_json=['data_info.json'],
app-1 | transform='default_train',
app-1 | load_vae_feat=False,
app-1 | load_t5_feat=False)
app-1 | image_size = 512
app-1 | train_batch_size = 2
app-1 | eval_batch_size = 16
app-1 | use_fsdp = False
app-1 | valid_num = 0
app-1 | fp32_attention = True
app-1 | model = 'PixArtMS_XL_2'
app-1 | aspect_ratio_type = 'ASPECT_RATIO_512'
app-1 | multi_scale = True
app-1 | pe_interpolation = 1.0
app-1 | qk_norm = False
app-1 | kv_compress = False
app-1 | kv_compress_config = dict(sampling=None, scale_factor=1, kv_compress_layer=[])
app-1 | num_workers = 10
app-1 | train_sampling_steps = 1000
app-1 | visualize = False
app-1 | eval_sampling_steps = 500
app-1 | model_max_length = 300
app-1 | lora_rank = 4
app-1 | num_epochs = 10
app-1 | gradient_accumulation_steps = 1
app-1 | grad_checkpointing = True
app-1 | gradient_clip = 0.01
app-1 | gc_step = 1
app-1 | auto_lr = dict(rule='sqrt')
app-1 | optimizer = dict(
app-1 | type='CAMEWrapper',
app-1 | lr=2e-05,
app-1 | weight_decay=0.0,
app-1 | eps=(1e-30, 1e-16),
app-1 | betas=(0.9, 0.999, 0.9999))
app-1 | lr_schedule = 'constant'
app-1 | lr_schedule_args = dict(num_warmup_steps=1000)
app-1 | save_image_epochs = 1
app-1 | save_model_epochs = 5
app-1 | save_model_steps = 2500
app-1 | sample_posterior = True
app-1 | mixed_precision = 'fp16'
app-1 | scale_factor = 0.13025
app-1 | ema_rate = 0.9999
app-1 | tensorboard_mox_interval = 50
app-1 | log_interval = 1
app-1 | cfg_scale = 4
app-1 | mask_type = 'null'
app-1 | num_group_tokens = 0
app-1 | mask_loss_coef = 0.0
app-1 | load_mask_index = False
app-1 | vae_pretrained = 'output/pretrained_models/models--madebyollin--sdxl-vae-fp16-fix'
app-1 | load_from = None
app-1 | resume_from = None
app-1 | snr_loss = False
app-1 | real_prompt_ratio = 0.5
app-1 | class_dropout_prob = 0.1
app-1 | work_dir = 'output/your_first_exp'
app-1 | s3_work_dir = None
app-1 | micro_condition = False
app-1 | seed = 43
app-1 | skip_step = 0
app-1 | loss_type = 'huber'
app-1 | huber_c = 0.001
app-1 | num_ddim_timesteps = 50
app-1 | w_max = 15.0
app-1 | w_min = 3.0
app-1 | ema_decay = 0.95
app-1 | image_list_json = ['data_info.json']
app-1 |
app-1 | 2024-03-30 10:08:37,889 - PixArt - INFO - World_size: 1, seed: 43
app-1 | 2024-03-30 10:08:37,889 - PixArt - INFO - Initializing: DDP for training
app-1 | Traceback (most recent call last):
app-1 | File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 398, in cached_file
app-1 | resolved_file = hf_hub_download(
app-1 | File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 111, in _inner_fn
app-1 | validate_repo_id(arg_value)
app-1 | File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 159, in validate_repo_id
app-1 | raise HFValidationError(
app-1 | huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'output/pretrained_models/pixart_omega_sdxl_256px_diffusers_from512'. Use `repo_type` argument if needed.
app-1 |
app-1 | The above exception was the direct cause of the following exception:
app-1 |
app-1 | Traceback (most recent call last):
app-1 | File "/pixart-sigma/train_scripts/train.py", line 344, in
app-1 | tokenizer = T5Tokenizer.from_pretrained(args.pipeline_load_from, subfolder="tokenizer")
app-1 | File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2007, in from_pretrained
app-1 | resolved_config_file = cached_file(
app-1 | File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 462, in cached_file
app-1 | raise EnvironmentError(
app-1 | OSError: Incorrect path_or_model_id: 'output/pretrained_models/pixart_omega_sdxl_256px_diffusers_from512'. Please provide either the path to a local folder or the repo_id of a model on the Hub.
app-1 | [2024-03-30 10:08:41,074] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 62) of binary: /usr/bin/python
app-1 | Traceback (most recent call last):
app-1 | File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
app-1 | return _run_code(code, main_globals, None,
app-1 | File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
app-1 | exec(code, run_globals)
app-1 | File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 198, in
app-1 | main()
app-1 | File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 194, in main
app-1 | launch(args)
app-1 | File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py", line 179, in launch
app-1 | run(args)
app-1 | File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
app-1 | elastic_launch(
app-1 | File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
app-1 | return launch_agent(self._config, self._entrypoint, list(args))
app-1 | File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
app-1 | raise ChildFailedError(
app-1 | torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
app-1 | ============================================================
app-1 | train_scripts/train.py FAILED
app-1 | ------------------------------------------------------------
app-1 | Failures:
app-1 |
app-1 | ------------------------------------------------------------
app-1 | Root Cause (first observed failure):
app-1 | [0]:
app-1 | time : 2024-03-30_10:08:41
app-1 | host : abd0123ba110
app-1 | rank : 0 (local_rank: 0)
app-1 | exitcode : 1 (pid: 62)
app-1 | error_file:
app-1 | traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
app-1 | ============================================================
app-1 exited with code 1
```
Upd: it seems meant any alpha models, i.e. PixArt-alpha/PixArt-XL-2-512x512, installed it and add to cmd --pipeline_load_from /pixart-sigma/output/pretrained_models/PixArt-alpha_PixArt-XL-2-512x512 (change path to where you downloaded alpha model)
And it works, now.
I "fixed" the problem from #11 and downloaded
models--madebyollin--sdxl-vae-fp16-fix
but there another model seems required, how to obtain it ?errors, during run train script:
complete log:
``` app-1 | app-1 | ========== app-1 | == CUDA == app-1 | ========== app-1 | app-1 | CUDA Version 12.1.1 app-1 | app-1 | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. app-1 | app-1 | This container image and its contents are governed by the NVIDIA Deep Learning Container License. app-1 | By pulling and using the container, you accept the terms and conditions of this license: app-1 | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license app-1 | app-1 | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience. app-1 | app-1 | /usr/local/lib/python3.10/dist-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated app-1 | and will be removed in future. Use torchrun. app-1 | Note that --use-env is set by default in torchrun. app-1 | If your script expects `--local-rank` argument to be set, please app-1 | change it to read from `os.environ['LOCAL_RANK']` instead. See app-1 | https://pytorch.org/docs/stable/distributed.html#launch-utility for app-1 | further instructions app-1 | app-1 | warnings.warn( app-1 | /usr/local/lib/python3.10/dist-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details. app-1 | warnings.warn( app-1 | File renamed to: output/your_first_exp/train_log_2024-03-30_10-06-59.log app-1 | 2024-03-30 10:08:37,869 - PixArt - INFO - Distributed environment: MULTI_GPU Backend: nccl app-1 | Num processes: 1 app-1 | Process index: 0 app-1 | Local process index: 0 app-1 | Device: cuda:0 app-1 | app-1 | Mixed precision type: fp16 app-1 | app-1 | 2024-03-30 10:08:37,889 - PixArt - INFO - Config: app-1 | data_root = 'pixart-sigma-toy-dataset' app-1 | data = dict( app-1 | type='InternalDataMSSigma', app-1 | root='InternData', app-1 | image_list_json=['data_info.json'], app-1 | transform='default_train', app-1 | load_vae_feat=False, app-1 | load_t5_feat=False) app-1 | image_size = 512 app-1 | train_batch_size = 2 app-1 | eval_batch_size = 16 app-1 | use_fsdp = False app-1 | valid_num = 0 app-1 | fp32_attention = True app-1 | model = 'PixArtMS_XL_2' app-1 | aspect_ratio_type = 'ASPECT_RATIO_512' app-1 | multi_scale = True app-1 | pe_interpolation = 1.0 app-1 | qk_norm = False app-1 | kv_compress = False app-1 | kv_compress_config = dict(sampling=None, scale_factor=1, kv_compress_layer=[]) app-1 | num_workers = 10 app-1 | train_sampling_steps = 1000 app-1 | visualize = False app-1 | eval_sampling_steps = 500 app-1 | model_max_length = 300 app-1 | lora_rank = 4 app-1 | num_epochs = 10 app-1 | gradient_accumulation_steps = 1 app-1 | grad_checkpointing = True app-1 | gradient_clip = 0.01 app-1 | gc_step = 1 app-1 | auto_lr = dict(rule='sqrt') app-1 | optimizer = dict( app-1 | type='CAMEWrapper', app-1 | lr=2e-05, app-1 | weight_decay=0.0, app-1 | eps=(1e-30, 1e-16), app-1 | betas=(0.9, 0.999, 0.9999)) app-1 | lr_schedule = 'constant' app-1 | lr_schedule_args = dict(num_warmup_steps=1000) app-1 | save_image_epochs = 1 app-1 | save_model_epochs = 5 app-1 | save_model_steps = 2500 app-1 | sample_posterior = True app-1 | mixed_precision = 'fp16' app-1 | scale_factor = 0.13025 app-1 | ema_rate = 0.9999 app-1 | tensorboard_mox_interval = 50 app-1 | log_interval = 1 app-1 | cfg_scale = 4 app-1 | mask_type = 'null' app-1 | num_group_tokens = 0 app-1 | mask_loss_coef = 0.0 app-1 | load_mask_index = False app-1 | vae_pretrained = 'output/pretrained_models/models--madebyollin--sdxl-vae-fp16-fix' app-1 | load_from = None app-1 | resume_from = None app-1 | snr_loss = False app-1 | real_prompt_ratio = 0.5 app-1 | class_dropout_prob = 0.1 app-1 | work_dir = 'output/your_first_exp' app-1 | s3_work_dir = None app-1 | micro_condition = False app-1 | seed = 43 app-1 | skip_step = 0 app-1 | loss_type = 'huber' app-1 | huber_c = 0.001 app-1 | num_ddim_timesteps = 50 app-1 | w_max = 15.0 app-1 | w_min = 3.0 app-1 | ema_decay = 0.95 app-1 | image_list_json = ['data_info.json'] app-1 | app-1 | 2024-03-30 10:08:37,889 - PixArt - INFO - World_size: 1, seed: 43 app-1 | 2024-03-30 10:08:37,889 - PixArt - INFO - Initializing: DDP for training app-1 | Traceback (most recent call last): app-1 | File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 398, in cached_file app-1 | resolved_file = hf_hub_download( app-1 | File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 111, in _inner_fn app-1 | validate_repo_id(arg_value) app-1 | File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 159, in validate_repo_id app-1 | raise HFValidationError( app-1 | huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'output/pretrained_models/pixart_omega_sdxl_256px_diffusers_from512'. Use `repo_type` argument if needed. app-1 | app-1 | The above exception was the direct cause of the following exception: app-1 | app-1 | Traceback (most recent call last): app-1 | File "/pixart-sigma/train_scripts/train.py", line 344, inUpd: it seems meant any alpha models, i.e. PixArt-alpha/PixArt-XL-2-512x512, installed it and add to cmd
--pipeline_load_from /pixart-sigma/output/pretrained_models/PixArt-alpha_PixArt-XL-2-512x512
(change path to where you downloaded alpha model) And it works, now.