training error in stage 1

BugsMaker0513 commented 9 months ago

File "Moore-AnimateAnyone/src/models/mutual_self_attention.py", line 180, in hacked_basic_transformer_inner_forward norm_hidden_states[_uc_mask], IndexError: The shape of the mask [2] at index 0 does not match the shape of the indexed tensor [3, 9216, 320] at index 0 Steps: 1%|▎ | 249/30000 [06:30<12:57:21, 1.57s/it, lr=1e-5, step_loss=0.107]

BugsMaker0513 commented 9 months ago

没有修改任何config和源码，只更改了config的数据集路径

lixunsong commented 9 months ago

是在训练过程中做 evaluation 时报的错吗？

BugsMaker0513 commented 9 months ago

是在训练过程中做 evaluation 时报的错吗？

是的。在此之前的一些log：

01/18/2024 20:46:06 - INFO - main - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda

Mixed precision type: fp16

{'force_upcast', 'scaling_factor'} was not found in config. Values will be initialized to default values. {'attention_type', 'transformer_layers_per_block', 'resnet_out_scale_factor', 'time_cond_proj_dim', 'resnet_time_scale_shift', 'time_embedding_type', 'num_attention_heads', 'conv_out_kernel', 'addition_embed_type', 'addition_embed_type_num_heads', 'dropout', 'conv_in_kernel', 'reverse_transformer_layers_per_block', 'cross_attention_norm', 'class_embed_type', 'timestep_post_act', 'encoder_hid_dim', 'resnet_skip_time_act', 'time_embedding_dim', 'mid_block_type', 'class_embeddings_concat', 'time_embedding_act_fn', 'mid_block_only_cross_attention', 'projection_class_embeddings_input_dim', 'upcast_attention', 'encoder_hid_dim_type', 'addition_time_embed_dim'} was not found in config. Values will be initialized to default values. Some weights of the model checkpoint were not used when initializing UNet2DConditionModel: ['conv_norm_out.weight, conv_norm_out.bias, conv_out.weight, conv_out.bias'] 01/18/2024 20:46:26 - INFO - src.models.unet_3d - loaded temporal unet's pretrained weights from pretrained_weights/sd-image-variations-diffusers/unet ... {'motion_module_decoder_only', 'motion_module_mid_block', 'class_embed_type', 'use_inflated_groupnorm', 'unet_use_cross_frame_attention', 'motion_module_type', 'motion_module_resolutions', 'upcast_attention', 'resnet_time_scale_shift', 'motion_module_kwargs'} was not found in config. Values will be initialized to default values. 01/18/2024 20:46:34 - INFO - src.models.unet_3d - Loaded 0.0M-parameter motion module 01/18/2024 20:46:49 - INFO - main - Missing key for pose guider: 2 01/18/2024 20:46:49 - INFO - main - Running training 01/18/2024 20:46:49 - INFO - main - Num examples = 499 01/18/2024 20:46:49 - INFO - main - Num Epochs = 240 01/18/2024 20:46:49 - INFO - main - Instantaneous batch size per device = 4 01/18/2024 20:46:49 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4 01/18/2024 20:46:49 - INFO - main - Gradient Accumulation steps = 1 01/18/2024 20:46:49 - INFO - main - Total optimization steps = 30000 Steps: 1%|▋ | 200/30000 [04:36<11:37:43, 1.40s/it, lr=1e-5, step_loss=0.0665]01/18/2024 20:51:26 - INFO - main - Running validation... 2024-01-18 20:51:28.890883153 [W:onnxruntime:, session_state.cc:1162 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2024-01-18 20:51:28.890915295 [W:onnxruntime:, session_state.cc:1164 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00, 2.78it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 53.12it/s] The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00, 2.78it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 201.49it/s] The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00, 2.78it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 206.58it/s] The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00, 2.78it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 197.38it/s] Steps: 1%|▊ | 249/30000 [06:25<11:27:41, 1.39s/it, lr=1e-5, step_loss=0.111]Traceback (most recent call last):00:00<?, ?it/s]

lixunsong commented 9 months ago

看起来是单卡训练的问题，我试着修复一下。你可以先在训练过程中不做 evaluation，训到一定阶段保存 ckpt 做测试。

BugsMaker0513 commented 9 months ago

看起来是单卡训练的问题，我试着修复一下。你可以先在训练过程中不做 evaluation，训到一定阶段保存 ckpt 做测试。

感谢～

BugsMaker0513 commented 9 months ago

OSError: [Errno 16] Device or resource busy: './Moore-AnimateAnyone/mlruns/577895558829709631/215c420ddf57411b86a37f6cd75c5667/meta.yaml'
Steps: 31%|█████████████▊ | 9171/30000 [8:10:47<18:34:40, 3.21s/it, lr=1e-5, step_loss=0.0424]
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1054950 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1054949) of binary: ./anaconda3/envs/moore/bin/python

用多卡训练的时候也出现了报错～

ypflll commented 9 months ago

2卡训练遇到同样的问题： Steps: 1%|▋ | 200/30000 [04:18<9:39:59, 1.17s/it, lr=1e-5, step_loss=0.0821]Traceback (most recent call last): File "train_stage_1.py", line 728, in main(config) File "train_stage_1.py", line 564, in main model_pred = net( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, *kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward return module_to_run(inputs[0], kwargs[0]) # type: ignore[index] File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/operations.py", line 581, in forward return model_forward(*args, *kwargs) File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/operations.py", line 569, in call return convert_to_fp32(self.model_forward(args, kwargs)) File "/opt/conda/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast return func(*args, kwargs) File "train_stage_1.py", line 87, in forward model_pred = self.denoising_unet( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "Moore-AnimateAnyone/src/models/unet_3d.py", line 493, in forward sample, res_samples = downsample_block( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "Moore-AnimateAnyone/src/models/unet_3d_blocks.py", line 442, in forward hidden_states = attn( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "Moore-AnimateAnyone/src/models/transformer_3d.py", line 140, in forward hidden_states = block( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, **kwargs) File "Moore-AnimateAnyone/src/models/mutual_self_attention.py", line 180, in hacked_basic_transformer_inner_forward norm_hidden_states[_uc_mask], IndexError: The shape of the mask [0] at index 0 does not match the shape of the indexed tensor [1, 4096, 320] at index 0 Steps: 1%|▋ | 200/30000 [04:19<10:43:40, 1.30s/it, lr=1e-5, step_loss=0.

看起来是第一次validation之后报错了。

BugsMaker0513 commented 9 months ago

2卡训练遇到同样的问题： Steps: 1%|▋ | 200/30000 [04:18<9:39:59, 1.17s/it, lr=1e-5, step_loss=0.0821]Traceback (most recent call last): File "train_stage_1.py", line 728, in main(config) File "train_stage_1.py", line 564, in main model_pred = net( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, *kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward return module_to_run(inputs[0], kwargs[0]) # type: ignore[index] File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/operations.py", line 581, in forward return model_forward(*args, kwargs) File "/opt/conda/lib/python3.8/site-packages/accelerate/utils/operations.py", line 569, in call* return convert_to_fp32(self.model_forward(args, kwargs)) File "/opt/conda/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast return func(*args, kwargs) File "train_stage_1.py", line 87, in forward model_pred = self.denoising_unet( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "Moore-AnimateAnyone/src/models/unet_3d.py", line 493, in forward sample, res_samples = downsample_block( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "Moore-AnimateAnyone/src/models/unet_3d_blocks.py", line 442, in forward hidden_states = attn( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "Moore-AnimateAnyone/src/models/transformer_3d.py", line 140, in forward hidden_states = block( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, **kwargs) File "Moore-AnimateAnyone/src/models/mutual_self_attention.py", line 180, in hacked_basic_transformer_inner_forward norm_hidden_states[_uc_mask], IndexError: The shape of the mask [0] at index 0 does not match the shape of the indexed tensor [1, 4096, 320] at index 0 Steps: 1%|▋ | 200/30000 [04:19<10:43:40, 1.30s/it, lr=1e-5, step_loss=0.

看起来是第一次validation之后报错了。

我用两个卡训练的时候没有遇到这个问题，不过遇到了另一个问题（上面贴了）。

EzrealLee9527 commented 9 months ago

因为eval的时候把ReferenceAttentionControl的do_classifier_free_guidance打开了，train的时候需要重置一下

lixunsong commented 9 months ago

我在我们的环境中测试，无论单卡还是多卡都没有遇到这个问题，请问大家在自己的训练中能够稳定复现这个问题吗？

We didn't encounter this issue in our environtment, whether with a single GPU or multiple GPUs. Can you consistently reproduce this problem during your training?

HaiLin545 commented 9 months ago

我在我们的环境中测试，无论单卡还是多卡都没有遇到这个问题，请问大家在自己的训练中能够稳定复现这个问题吗？

We didn't encounter this issue in our environtment, whether with a single GPU or multiple GPUs. Can you consistently reproduce this problem during your training?

在3090上，单卡和多卡都稳定复现，只能注释掉evaluate的代码

theSha1do1w commented 9 months ago

我在我们的环境中测试，无论单卡还是多卡都没有遇到这个问题，请问大家在自己的训练中能够稳定复现这个问题吗？ We didn't encounter this issue in our environtment, whether with a single GPU or multiple GPUs. Can you consistently reproduce this problem during your training?

在3090上，单卡和多卡都稳定复现，只能注释掉evaluate的代码

3090可以跑起来stage1的训练吗？

HaiLin545 commented 9 months ago

我在我们的环境中测试，无论单卡还是多卡都没有遇到这个问题，请问大家在自己的训练中能够稳定复现这个问题吗？ We didn't encounter this issue in our environtment, whether with a single GPU or multiple GPUs. Can you consistently reproduce this problem during your training?

在3090上，单卡和多卡都稳定复现，只能注释掉evaluate的代码

3090可以跑起来stage1的训练吗？

全量跑不起来，跑lora就可以

EzrealLee9527 commented 9 months ago

因为eval的时候把ReferenceAttentionControl的do_classifier_free_guidance打开了，train的时候需要重置一下

代码中训练和测试的 ReferenceAttentionControl 是单独初始化的，他们之间应该没有影响的

The ReferenceAttentionControl for training and testing is initialized separately, and there should be no impact between them.

测试的初始化把训练的初始化覆盖掉了，并且由于训练是写在初始化而不是forward里所以并不会重置回来，你可以打印下训练时候do_classifier_free_guidance的行为看看，在测试前后是不一致的

lixunsong commented 9 months ago

因为eval的时候把ReferenceAttentionControl的do_classifier_free_guidance打开了，train的时候需要重置一下

代码中训练和测试的 ReferenceAttentionControl 是单独初始化的，他们之间应该没有影响的 The ReferenceAttentionControl for training and testing is initialized separately, and there should be no impact between them.

测试的初始化把训练的初始化覆盖掉了，并且由于训练是写在初始化而不是forward里所以并不会重置回来，你可以打印下训练时候do_classifier_free_guidance的行为看看，在测试前后是不一致的

嗯是的，是这个问题导致的

EzrealLee9527 commented 9 months ago

因为eval的时候把ReferenceAttentionControl的do_classifier_free_guidance打开了，train的时候需要重置一下

代码中训练和测试的 ReferenceAttentionControl 是单独初始化的，他们之间应该没有影响的 The ReferenceAttentionControl for training and testing is initialized separately, and there should be no impact between them.

测试的初始化把训练的初始化覆盖掉了，并且由于训练是写在初始化而不是forward里所以并不会重置回来，你可以打印下训练时候do_classifier_free_guidance的行为看看，在测试前后是不一致的

嗯是的，是这个问题导致的

写在forward里就好啦请问下你们训练的数据集全部是静态纯色背景？我目前训练的motion module效果远差于animatediff的那种效果，不自然的晃动感太强烈了，你们有做过相应的实验验证是数据偏置的问题还是方法本身的问题吗

HaiLin545 commented 9 months ago

一个简单的解决办法就是：在eval的时候，传给log_validatation的net，做一个深拷贝就好了，避免覆盖训练的net 把

reference_unet = ori_net.reference_unet
denoising_unet = ori_net.denoising_unet

改成

reference_unet = copy.deepcopy(ori_net.reference_unet)
denoising_unet = copy.deepcopy(ori_net.denoising_unet)

TZYSJTU commented 9 months ago

因为eval的时候把ReferenceAttentionControl的do_classifier_free_guidance打开了，train的时候需要重置一下

遇到同样的问题，还没仔细看代码。请问这个怎么重置，能给一个具体修改方式吗？

TZYSJTU commented 9 months ago

因为eval的时候把ReferenceAttentionControl的do_classifier_free_guidance打开了，train的时候需要重置一下

代码中训练和测试的 ReferenceAttentionControl 是单独初始化的，他们之间应该没有影响的 The ReferenceAttentionControl for training and testing is initialized separately, and there should be no impact between them.

测试的初始化把训练的初始化覆盖掉了，并且由于训练是写在初始化而不是forward里所以并不会重置回来，你可以打印下训练时候do_classifier_free_guidance的行为看看，在测试前后是不一致的

嗯是的，是这个问题导致的

遇到同样的问题，请问这个怎么重置，能给一个具体修改方式吗？

TZYSJTU commented 9 months ago

我在我们的环境中测试，无论单卡还是多卡都没有遇到这个问题，请问大家在自己的训练中能够稳定复现这个问题吗？

We didn't encounter this issue in our environtment, whether with a single GPU or multiple GPUs. Can you consistently reproduce this problem during your training?

能稳定复现，但是我不明白为什么会在第一次validation之后又训练若干步才报错，而不是validation完继续训的第一步就报错。

TZYSJTU commented 9 months ago

是在训练过程中做 evaluation 时报的错吗？

是的。在此之前的一些log：

01/18/2024 20:46:06 - INFO - main - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda

Mixed precision type: fp16

{'force_upcast', 'scaling_factor'} was not found in config. Values will be initialized to default values. {'attention_type', 'transformer_layers_per_block', 'resnet_out_scale_factor', 'time_cond_proj_dim', 'resnet_time_scale_shift', 'time_embedding_type', 'num_attention_heads', 'conv_out_kernel', 'addition_embed_type', 'addition_embed_type_num_heads', 'dropout', 'conv_in_kernel', 'reverse_transformer_layers_per_block', 'cross_attention_norm', 'class_embed_type', 'timestep_post_act', 'encoder_hid_dim', 'resnet_skip_time_act', 'time_embedding_dim', 'mid_block_type', 'class_embeddings_concat', 'time_embedding_act_fn', 'mid_block_only_cross_attention', 'projection_class_embeddings_input_dim', 'upcast_attention', 'encoder_hid_dim_type', 'addition_time_embed_dim'} was not found in config. Values will be initialized to default values. Some weights of the model checkpoint were not used when initializing UNet2DConditionModel: ['conv_norm_out.weight, conv_norm_out.bias, conv_out.weight, conv_out.bias'] 01/18/2024 20:46:26 - INFO - src.models.unet_3d - loaded temporal unet's pretrained weights from pretrained_weights/sd-image-variations-diffusers/unet ... {'motion_module_decoder_only', 'motion_module_mid_block', 'class_embed_type', 'use_inflated_groupnorm', 'unet_use_cross_frame_attention', 'motion_module_type', 'motion_module_resolutions', 'upcast_attention', 'resnet_time_scale_shift', 'motion_module_kwargs'} was not found in config. Values will be initialized to default values. 01/18/2024 20:46:34 - INFO - src.models.unet_3d - Loaded 0.0M-parameter motion module 01/18/2024 20:46:49 - INFO - main - Missing key for pose guider: 2 01/18/2024 20:46:49 - INFO - main - Running training 01/18/2024 20:46:49 - INFO - main - Num examples = 499 01/18/2024 20:46:49 - INFO - main - Num Epochs = 240 01/18/2024 20:46:49 - INFO - main - Instantaneous batch size per device = 4 01/18/2024 20:46:49 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4 01/18/2024 20:46:49 - INFO - main - Gradient Accumulation steps = 1 01/18/2024 20:46:49 - INFO - main - Total optimization steps = 30000 Steps: 1%|▋ | 200/30000 [04:36<11:37:43, 1.40s/it, lr=1e-5, step_loss=0.0665]01/18/2024 20:51:26 - INFO - main - Running validation... 2024-01-18 20:51:28.890883153 [W:onnxruntime:, session_state.cc:1162 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. 2024-01-18 20:51:28.890915295 [W:onnxruntime:, session_state.cc:1164 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments. The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00, 2.78it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 53.12it/s] The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00, 2.78it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 201.49it/s] The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00, 2.78it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 206.58it/s] The passed generator was created on 'cpu' even though a tensor on cuda:0 was expected. Tensors will be created on 'cpu' and then moved to cuda:0. Note that one can probably slighly speed up this function by passing a generator that was created on the cuda:0 device. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:07<00:00, 2.78it/s] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 197.38it/s] Steps: 1%|▊ | 249/30000 [06:25<11:27:41, 1.39s/it, lr=1e-5, step_loss=0.111]Traceback (most recent call last):00:00<?, ?it/s]

完全相同的报错，也是249步。已解决。首先你是用的ubc数据集吧。然后ubc数据集有一条在提取dwpose的时候会有bug，所以删掉之后还剩499条。然后batchsize你是2吧。这样最后一个batch就只有1条数据了。而代码作者没考虑到不满一个batch的情况，就出现了这个问题。

解决方法很简单，在创建dataloader时，参数drop_last=True

TZYSJTU commented 9 months ago

我宣布此问题已解决，上面那些人乱说什么 do_classifier_free_guidance 的问题，根本不是。问题的原因是batchsize, 代码作者在创建dataloader的时候没有设置 drop_last=True而代码又假设了batchsize是满的。这样如果最后一个batch不是满的，就会有 IndexError: The shape of the mask [0] at index 0 does not match the shape of the indexed tensor [1, 你的batch*768, 320] at index 0

PS: 如何发现这个问题的。我用ubc 499条数据，batchsize=2，每次都是跑完249步，第250步bug。然后自己的数据集是1000条数据，就不报错。添加drop_last=True后就正常了。

renrenzsbbb commented 9 months ago

因为eval的时候把ReferenceAttentionControl的do_classifier_free_guidance打开了，train的时候需要重置一下

代码中训练和测试的 ReferenceAttentionControl 是单独初始化的，他们之间应该没有影响的 The ReferenceAttentionControl for training and testing is initialized separately, and there should be no impact between them.

测试的初始化把训练的初始化覆盖掉了，并且由于训练是写在初始化而不是forward里所以并不会重置回来，你可以打印下训练时候do_classifier_free_guidance的行为看看，在测试前后是不一致的

嗯是的，是这个问题导致的

写在forward里就好啦请问下你们训练的数据集全部是静态纯色背景？我目前训练的motion module效果远差于animatediff的那种效果，不自然的晃动感太强烈了，你们有做过相应的实验验证是数据偏置的问题还是方法本身的问题吗

还有一种方法就是inference之后再包一层： reference_control_writer = ReferenceAttentionControl( reference_unet, do_classifier_free_guidance=False, mode="write", fusion_blocks="full", ) reference_control_reader = ReferenceAttentionControl( denoising_unet, do_classifier_free_guidance=False, mode="read", fusion_blocks="full", )

jim-1ee commented 7 months ago

我宣布此问题已解决，上面那些人乱说什么 do_classifier_free_guidance 的问题，根本不是。问题的原因是batchsize, 代码作者在创建dataloader的时候没有设置 drop_last=True而代码又假设了batchsize是满的。这样如果最后一个batch不是满的，就会有 IndexError: The shape of the mask [0] at index 0 does not match the shape of the indexed tensor [1, 你的batch*768, 320] at index 0

PS: 如何发现这个问题的。我用ubc 499条数据，batchsize=2，每次都是跑完249步，第250步bug。然后自己的数据集是1000条数据，就不报错。添加drop_last=True后就正常了。

我设置的batchsize=1，还是一样的报错。

ButoneDream commented 7 months ago

一个简单的解决办法就是：在eval的时候，传给log_validatation的net，做一个深拷贝就好了，避免覆盖训练的net 把
reference_unet = ori_net.reference_unet
denoising_unet = ori_net.denoising_unet
改成
reference_unet = copy.deepcopy(ori_net.reference_unet)
denoising_unet = copy.deepcopy(ori_net.denoising_unet)

may encounter OOM if VARM is not enough

guoti777 commented 5 months ago

因为eval的时候把ReferenceAttentionControl的do_classifier_free_guidance打开了，train的时候需要重置一下

代码中训练和测试的 ReferenceAttentionControl 是单独初始化的，他们之间应该没有影响的 The ReferenceAttentionControl for training and testing is initialized separately, and there should be no impact between them.

测试的初始化把训练的初始化覆盖掉了，并且由于训练是写在初始化而不是forward里所以并不会重置回来，你可以打印下训练时候do_classifier_free_guidance的行为看看，在测试前后是不一致的

嗯是的，是这个问题导致的

写在forward里就好啦请问下你们训练的数据集全部是静态纯色背景？我目前训练的motion module效果远差于animatediff的那种效果，不自然的晃动感太强烈了，你们有做过相应的实验验证是数据偏置的问题还是方法本身的问题吗

还有一种方法就是inference之后再包一层： reference_control_writer = ReferenceAttentionControl( reference_unet, do_classifier_free_guidance=False, mode="write", fusion_blocks="full", ) reference_control_reader = ReferenceAttentionControl( denoising_unet, do_classifier_free_guidance=False, mode="read", fusion_blocks="full", )

it works, thanks!

Enderfga commented 2 months ago

我宣布此问题已解决，上面那些人乱说什么 do_classifier_free_guidance 的问题，根本不是。问题的原因是batchsize, 代码作者在创建dataloader的时候没有设置 drop_last=True而代码又假设了batchsize是满的。这样如果最后一个batch不是满的，就会有 IndexError: The shape of the mask [0] at index 0 does not match the shape of the indexed tensor [1, 你的batch*768, 320] at index 0

PS: 如何发现这个问题的。我用ubc 499条数据，batchsize=2，每次都是跑完249步，第250步bug。然后自己的数据集是1000条数据，就不报错。添加drop_last=True后就正常了。

用其他的两个方案都能解决你这个可能是你个人的特殊情况；我没有使用这个数据集且 bs 为 1 也会出现这个 bug

Adding the following code after log_validation is the best solution I have tested, as it prevents OOM without requiring deep copies:

reference_control_writer = ReferenceAttentionControl(
    reference_unet,
    do_classifier_free_guidance=False,
    mode="write",
    fusion_blocks="full",
)

reference_control_reader = ReferenceAttentionControl(
    denoising_unet,
    do_classifier_free_guidance=False,
    mode="read",
    fusion_blocks="full",
)

aleeyang commented 2 months ago

因为eval的时候把ReferenceAttentionControl的do_classifier_free_guidance打开了，train的时候需要重置一下

代码中训练和测试的 ReferenceAttentionControl 是单独初始化的，他们之间应该没有影响的 The ReferenceAttentionControl for training and testing is initialized separately, and there should be no impact between them.

测试的初始化把训练的初始化覆盖掉了，并且由于训练是写在初始化而不是forward里所以并不会重置回来，你可以打印下训练时候do_classifier_free_guidance的行为看看，在测试前后是不一致的

嗯是的，是这个问题导致的

写在forward里就好啦请问下你们训练的数据集全部是静态纯色背景？我目前训练的motion module效果远差于animatediff的那种效果，不自然的晃动感太强烈了，你们有做过相应的实验验证是数据偏置的问题还是方法本身的问题吗

朋友，我也遇到这个问题，能交流下吗？

aleeyang commented 2 months ago

我在我们的环境中测试，无论单卡还是多卡都没有遇到这个问题，请问大家在自己的训练中能够稳定复现这个问题吗？ We didn't encounter this issue in our environtment, whether with a single GPU or multiple GPUs. Can you consistently reproduce this problem during your training?

在3090上，单卡和多卡都稳定复现，只能注释掉evaluate的代码

3090可以跑起来stage1的训练吗？

全量跑不起来，跑lora就可以

朋友，有lora的例子吗？怎么修改

MooreThreads / Moore-AnimateAnyone

training error in stage 1 #54