Tencent / HunyuanDiT

Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
https://dit.hunyuan.tencent.com/
Other
3.32k stars 285 forks source link

[Feature request] Add a setting to not use t5 encoder for train #94

Open Bocchi-Chan2023 opened 3 months ago

Bocchi-Chan2023 commented 3 months ago

Describe the feature Currently, traing Hunyuan Dit requires a significant amount of vram. Regarding this, I noticed that the T5 encoder uses a lot of vram. So I wanted to have a setting to not train this T5 encoder.

Motivation I noticed that during LoRA and fine-tuning, we were using a lot of VRAM relative to the size of the model.

Related resources

Additional context For example, it would be appreciated if you could add an option such as --no-t5 that can be disabled at train time.

C0nsumption commented 3 months ago

Describe the feature Currently, traing Hunyuan Dit requires a significant amount of vram. Regarding this, I noticed that the T5 encoder uses a lot of vram. So I wanted to have a setting to not train this T5 encoder.

Motivation I noticed that during LoRA and fine-tuning, we were using a lot of VRAM relative to the size of the model.

Related resources

Additional context For example, it would be appreciated if you could add an option such as --no-t5 that can be disabled at train time.

Can you expand on this? I’m trying to train, wanted some insight on GPU requirements.

Bocchi-Chan2023 commented 3 months ago

Describe the feature Currently, traing Hunyuan Dit requires a significant amount of vram. Regarding this, I noticed that the T5 encoder uses a lot of vram. So I wanted to have a setting to not train this T5 encoder. Motivation I noticed that during LoRA and fine-tuning, we were using a lot of VRAM relative to the size of the model. Related resources Additional context For example, it would be appreciated if you could add an option such as --no-t5 that can be disabled at train time.

Can you expand on this? I’m trying to train, wanted some insight on GPU requirements.

Uses 20GB of vram for batch 1 768x training for now

Sunburst7 commented 1 month ago

申请认领

Sunburst7 commented 1 month ago

bug report

I found a bug when I try to train the model by following the README's instruction

Traceback (most recent call last):
  File "hydit/train_deepspeed.py", line 531, in <module>
Traceback (most recent call last):
  File "hydit/train_deepspeed.py", line 531, in <module>
Traceback (most recent call last):
  File "hydit/train_deepspeed.py", line 531, in <module>
        main(get_args())main(get_args())

  File "hydit/train_deepspeed.py", line 208, in main
  File "hydit/train_deepspeed.py", line 208, in main
    main(get_args())
  File "hydit/train_deepspeed.py", line 208, in main
    with open(f"{experiment_dir}/args.json", 'w') as f:
    with open(f"{experiment_dir}/args.json", 'w') as f:PermissionError
: [Errno 13] Permission denied: '/args.json'
    PermissionErrorwith open(f"{experiment_dir}/args.json", 'w') as f:
: [Errno 13] Permission denied: '/args.json'
PermissionError: [Errno 13] Permission denied: '/args.json'

after that i check the code and find error in function "create_exp_folder":

def create_exp_folder(args, rank):
    if rank == 0:
        os.makedirs(args.results_dir, exist_ok=True)
    existed_experiments = list(Path(args.results_dir).glob("*dit*"))
    if len(existed_experiments) == 0:
        experiment_index = 1
    else:
        existed_experiments.sort()
        print('existed_experiments', existed_experiments)
        experiment_index = max([int(x.stem.split('-')[0]) for x in existed_experiments]) + 1
    dist.barrier()
    model_string_name = args.task_flag if args.task_flag else args.model.replace("/", "-")
    experiment_dir = f"{args.results_dir}/{experiment_index:03d}-{model_string_name}"       # Create an experiment folder
    checkpoint_dir = f"{experiment_dir}/checkpoints"                                        # Stores saved model checkpoints
    if rank == 0:
        os.makedirs(checkpoint_dir, exist_ok=True)
        logger = create_logger(experiment_dir)
        logger.info(f"Experiment directory created at {experiment_dir}")
    else:
        logger = create_logger()
        experiment_dir = "" # here!

    return experiment_dir, checkpoint_dir, logger

in the distribute data-parallel training system, the subprocess whose rank is not zero got the empty experiment_dir so they tried to open /args.json which they don't have access to