TencentARC / T2I-Adapter

T2I-Adapter
3.47k stars 208 forks source link

train_seg out of memory with a batch size of 8 #38

Open zhiqi-li opened 1 year ago

zhiqi-li commented 1 year ago

HI, the paper reports the model is trained with a batch size of 8 on 32G V100, but I got out of memory with the default settings in train_seg.py. When I set the batch size to 4, the memory is about 27G, so I am slightly confused about this problem because the learning rate maybe need to adjust when I set a different batch size from yours.

zhiqi-li commented 1 year ago

I also want to ask if the training log of train_seg.py can be provided as a reference.

MC-E commented 1 year ago

We use distributed training on four V100 GPUs, each with the batchsize being 2.

zhiqi-li commented 1 year ago

For the coco stuff dataset, one image corresponds to several captions, what I want to ask is whether the amount of data in one epoch is equal to the number of captions or the number of images. If it is the number of captions, one epoch has more than 600k data, and it seems that it needs much longer time than 2 days for 10 epochs.

bychen7 commented 1 year ago

For the coco stuff dataset, one image corresponds to several captions, what I want to ask is whether the amount of data in one epoch is equal to the number of captions or the number of images. If it is the number of captions, one epoch has more than 600k data, and it seems that it needs much longer time than 2 days for 10 epochs.

The same issue, I trained on 8 V100, with a total batch size of 8x2, The time of one epoch was about 10 hours.

MC-E commented 1 year ago

In coco dataset, each image contains 5 captions. The current open source code is that these 5 captions appear in an epoch. The training procedure we use is to randomly select a caption for each image. The adapter converges quickly with training.

wanghao14 commented 1 year ago

@zhiqi-li @blackmagicianZ Hi, have you successfully trained the model to achieve results close to those in the paper?

wanghao14 commented 1 year ago

@MC-E Hi, I am training your code on semantic segmentation map. Do you remember how many epochs it takes to see the convergence of adapter in this condition?

lanzehua commented 1 year ago

@MC-E Hi, I would like to train your code on the celebA dataset, May I ask if I should use script train.py or write another script train_human_face.py by myself? thx a lot for your help!

MERONAL commented 1 year ago

@MC-E @zhiqi-li Hi, I can't complete training on multiple GPUs, each training only uses one GPU, how do I start multi-GPU training?

enkaranfiles commented 1 year ago

@MERONAL To kick off multi-GPU training properly, ensure you set the RANK and WORLD_SIZE parameters before diving into training (these are torch distributed training parameters). Also, be cautious about the default GPU_IDS specified in the code – they're configured for 4 GPUs (0, 1, 2, 3). You'll need to adjust these according to your own setup. Additionally, remember to run the torchrun command with the --nproc_per_node flag to orchestrate the process effectively.

enkaranfiles commented 1 year ago

For the coco stuff dataset, one image corresponds to several captions, what I want to ask is whether the amount of data in one epoch is equal to the number of captions or the number of images. If it is the number of captions, one epoch has more than 600k data, and it seems that it needs much longer time than 2 days for 10 epochs.

The same issue, I trained on 8 V100, with a total batch size of 8x2, The time of one epoch was about 10 hours.

What was your one GPU size, I am curious? Is there any-way to work on this repo(in SD branch - excluding XL part), 8-16 GB V100 GPU? I am kind of angry to Google, there is no available resource for A100 right now, all my work stuck in the machine! @bychen7

shoutOutYangJie commented 10 months ago

how mush training step can arise Controlable ability? I found loss is very small at the begin of training.

dmmSJTU commented 10 months ago

@wanghao14 Hi, Could you please add your WeChat and ask some questions about training? my email is dmm2020@sjtu.edu.cn

dmmSJTU commented 10 months ago

@zhiqi-li Hi, Could you please add your WeChat and ask some questions about training? my email is dmm2020@sjtu.edu.cn.

wanghao14 commented 10 months ago

@wanghao14 Hi, Could you please add your WeChat and ask some questions about training? my email is dmm2020@sjtu.edu.cn

Post your question here instead of requesting personal contact information.

dmmSJTU commented 10 months ago

Hi Wanghao, https://github.com/TencentARC/T2I-Adapter/blob/16bba674b472121d5a86e3ed6b935f91d516bc74/train_sketch.py#L231 How do you obtain the mask images of train2017_color? Are you using stuff_train2017_pixelmaps? Look forward for your reply.

发件人: Wang Hao 发送时间: 2023年12月27日 0:05 收件人: TencentARC/T2I-Adapter 抄送: dmmSJTU; Comment 主题: Re: [TencentARC/T2I-Adapter] train_seg out of memory with a batchsize of 8 (Issue #38)

@wanghao14 Hi, Could you please add your WeChat and ask some questions about training? my email is @. Post your question here instead of requesting personal contact information. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

wanghao14 commented 10 months ago

@dmmSJTU Yes, I have used the segmentation map provided by COCO and converted the IDs (pixel class) to RGB values using this code. I have trained an image inpainting model on segmentation condition based on the idea of T2I-Adapter and it works well.

Hope this could help you.

You can also refer to:

22 #25

dmmSJTU commented 10 months ago

Thank you. When I used “CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2 --use-env train_seg.py --ckpt models/sd-v1-4.ckpt --bsize 2”, it produced:

Could you help me to solve it? 发件人: Wang Hao 发送时间: 2023年12月27日 15:26 收件人: TencentARC/T2I-Adapter 抄送: dmmSJTU; Mention 主题: Re: [TencentARC/T2I-Adapter] train_seg out of memory with a batchsize of 8 (Issue #38)

@dmmSJTU Yes, I have used the segmentation map provided by COCO and converted the IDs (pixel class) to RGB values using this code. I have trained an image inpainting model on segmentation condition based on the idea of T2I-Adapter and it works well. Hope this could help you. You can also refer to:

14 #22

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

dmmSJTU commented 10 months ago

@wanghao14 @zhiqi-li @MERONAL Hi, When i train the train_seg.py on single gpu, it produce:

Snipaste_2023-12-28_14-58-55

Besides, When i use "CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2 --use-env train_seg.py --ckpt models/sd-v1-4.ckpt --bsize 2" to run on multi-gpu, it produce: 123

Have you encountered the same problem? ​Look forward to your reply!

dmmSJTU commented 10 months ago

  wanghao 同学你好,可以加你好友问几个问题吗?在这里问比较麻烦,时效性也不太好。

-----原始邮件-----

发件人: Wang @.> 目标语言: TencentARC @.> 抄送: dmmSJTU @.>; Mention @.> 日期: 2023年12月28日星期四 15:41 CST 主题: Re: [TencentARC/T2I-Adapter] train_seg out of memory with a batch size of 8 (Issue #38)

  @dmmSJTU There appears to be an issue with the initialization of distributed training. Please verify the number of GPUs available in your environment and check whether a command has been included in the code to specify the utilization of a particular graphics card, for example, using 'os.environ['CUDA_VISIBLE_DEVICES'] = '0'." This issue might not be related to this code. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

wanghao14 commented 10 months ago

I am sorry, but the question you've asked isn't pertinent to this project. It seems to be related to your personal environment configuration, and I'm not interested in it.