Open Bocchi-Chan2023 opened 3 months ago
Describe the feature Currently, traing Hunyuan Dit requires a significant amount of vram. Regarding this, I noticed that the T5 encoder uses a lot of vram. So I wanted to have a setting to not train this T5 encoder.
Motivation I noticed that during LoRA and fine-tuning, we were using a lot of VRAM relative to the size of the model.
Related resources
Additional context For example, it would be appreciated if you could add an option such as --no-t5 that can be disabled at train time.
Can you expand on this? I’m trying to train, wanted some insight on GPU requirements.
Describe the feature Currently, traing Hunyuan Dit requires a significant amount of vram. Regarding this, I noticed that the T5 encoder uses a lot of vram. So I wanted to have a setting to not train this T5 encoder. Motivation I noticed that during LoRA and fine-tuning, we were using a lot of VRAM relative to the size of the model. Related resources Additional context For example, it would be appreciated if you could add an option such as --no-t5 that can be disabled at train time.
Can you expand on this? I’m trying to train, wanted some insight on GPU requirements.
Uses 20GB of vram for batch 1 768x training for now
申请认领
I found a bug when I try to train the model by following the README's instruction
Traceback (most recent call last):
File "hydit/train_deepspeed.py", line 531, in <module>
Traceback (most recent call last):
File "hydit/train_deepspeed.py", line 531, in <module>
Traceback (most recent call last):
File "hydit/train_deepspeed.py", line 531, in <module>
main(get_args())main(get_args())
File "hydit/train_deepspeed.py", line 208, in main
File "hydit/train_deepspeed.py", line 208, in main
main(get_args())
File "hydit/train_deepspeed.py", line 208, in main
with open(f"{experiment_dir}/args.json", 'w') as f:
with open(f"{experiment_dir}/args.json", 'w') as f:PermissionError
: [Errno 13] Permission denied: '/args.json'
PermissionErrorwith open(f"{experiment_dir}/args.json", 'w') as f:
: [Errno 13] Permission denied: '/args.json'
PermissionError: [Errno 13] Permission denied: '/args.json'
after that i check the code and find error in function "create_exp_folder":
def create_exp_folder(args, rank):
if rank == 0:
os.makedirs(args.results_dir, exist_ok=True)
existed_experiments = list(Path(args.results_dir).glob("*dit*"))
if len(existed_experiments) == 0:
experiment_index = 1
else:
existed_experiments.sort()
print('existed_experiments', existed_experiments)
experiment_index = max([int(x.stem.split('-')[0]) for x in existed_experiments]) + 1
dist.barrier()
model_string_name = args.task_flag if args.task_flag else args.model.replace("/", "-")
experiment_dir = f"{args.results_dir}/{experiment_index:03d}-{model_string_name}" # Create an experiment folder
checkpoint_dir = f"{experiment_dir}/checkpoints" # Stores saved model checkpoints
if rank == 0:
os.makedirs(checkpoint_dir, exist_ok=True)
logger = create_logger(experiment_dir)
logger.info(f"Experiment directory created at {experiment_dir}")
else:
logger = create_logger()
experiment_dir = "" # here!
return experiment_dir, checkpoint_dir, logger
in the distribute data-parallel training system, the subprocess whose rank is not zero got the empty experiment_dir so they tried to open /args.json which they don't have access to
Describe the feature Currently, traing Hunyuan Dit requires a significant amount of vram. Regarding this, I noticed that the T5 encoder uses a lot of vram. So I wanted to have a setting to not train this T5 encoder.
Motivation I noticed that during LoRA and fine-tuning, we were using a lot of VRAM relative to the size of the model.
Related resources
Additional context For example, it would be appreciated if you could add an option such as --no-t5 that can be disabled at train time.