OFA-Sys / DAFlow

124 stars 17 forks source link

The training result is blank #11

Open wang674 opened 1 year ago

wang674 commented 1 year ago

The training result is blank。 image

ShuaiBai623 commented 1 year ago

Are there screenshots of the training process?

wang674 commented 1 year ago

No error

------------------ 原始邮件 ------------------ 发件人: "OFA-Sys/DAFlow" @.>; 发送时间: 2022年11月14日(星期一) 下午3:48 @.>; @.**@.>; 主题: Re: [OFA-Sys/DAFlow] The training result is blank (Issue #11)

Are there screenshots of the training process?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

kanthprashant commented 1 year ago

Hi @wang674 ,

Have you been able to solve this problem? I am encountering a similar issue while fine-tuning model on a custom dataset. The model produces the expected output until epoch 6, but afterwards, it begins to generate blank outputs.

kanthprashant commented 1 year ago

The network is sensitive to weight initialisation and learning rate. If we use proper learning rate initially and do default weight initialisation, works well.

hyyuan123 commented 1 year ago

The network is sensitive to weight initialisation and learning rate. If we use proper learning rate initially and do default weight initialisation, works well.

May I ask if the test results of your retrained model are good? I trained the model using the code and data provided by the author, but the testing results were not good 225506884-9522cd05-dc8d-4d1b-bbad-7739545b4e9f

kanthprashant commented 1 year ago

The network is sensitive to weight initialisation and learning rate. If we use proper learning rate initially and do default weight initialisation, works well.

May I ask if the test results of your retrained model are good? I trained the model using the code and data provided by the author, but the testing results were not good 225506884-9522cd05-dc8d-4d1b-bbad-7739545b4e9f

Hi @hyyuan123 , Yes, I was able to get considerably good results, using the same training code.

hyyuan123 commented 1 year ago

The network is sensitive to weight initialisation and learning rate. If we use proper learning rate initially and do default weight initialisation, works well.

May I ask if the test results of your retrained model are good? I trained the model using the code and data provided by the author, but the testing results were not good 225506884-9522cd05-dc8d-4d1b-bbad-7739545b4e9f

Hi @hyyuan123 , Yes, I was able to get considerably good results, using the same training code.

@kanthprashant Thank you for your reply. I'll try again

1BTU commented 1 year ago

The training result is blank。 image

Hello,I had a similar problem with you,I used my equipment to train the weight given by the git_hub of author .However ,after many rounds of training,even though my learning rate had been modified very small,the predicted result was still gray and white .Later,I found that my code had two problems: the first one was when the model was saved:

 torch.save(
            {
                "state_dict": sdafnet.state_dict(),
            },
            "savemodel.pt",
        )
    torch.save(sdafnet.state_dict(), "savemodel.pt")

In my code ,I save the model in dictionary form,but load the prediction directly ,that is , load it into the network by net.load_state_dict(), which leads to inaccurate prediction results. Here I save the weights by the following method:

torch.save(net.state_dict(),save_path)

The second reason is this: I'm going to comment out the sdafnet = torch.nn.DataParallel(sdafnet, device_ids=range(torch.cuda.device_count())),save the model ,and it's going to to be good,for the following reasons:

nn.DataParallel is a module used for parallel computing on multiple GPUs. It can replicate a model to multiple GPUs and execute forward and backward propagation of input data in parallel. If you only have one GPU, it is not necessary to use nn.DataParallel.

If you use nn.DataParallel in your code to load a trained model weight and perform prediction on a single GPU, it may cause inaccurate, unstable, or even strange errors in the predicted results. This is because during the execution of nn.DataParallel, the model is replicated to multiple GPUs, and when predicting, only one GPU is used, which may result in different predicted results from the original model on a single GPU. Additionally, since nn.DataParallel divides the input data into multiple small batches for processing, this also affects the predicted results.

The solution is to directly load the weights on a single GPU when loading the model weights, rather than loading the weights on multiple GPUs. If you need to train using multiple GPUs, you can use nn.DataParallel in the training code, but remove it when performing validation or testing, and only use a single GPU for prediction.

xxxxl888 commented 1 year ago

你好,我想问问我的训练过程中保存的模型文件加载不出来,训练结果也是空的,loss值也没收敛,可能存在什么问题?

1BTU commented 1 year ago

你好,我想问问我的训练过程中保存的模型文件加载不出来,训练结果也是空的,loss值也没收敛,可能存在什么问题? You can read my response above ,may be it can help u.

xxxxl888 commented 1 year ago

你好,我想问问我的训练过程中保存的模型文件加载不出来,训练结果也是空的,loss值也没收敛,可能存在什么问题? You can read my response above ,may be it can help u. 你好,感谢你的提醒,我想再请教一个问题,我该如何计算FID和SSIM的分数?用哪两个数据集或者图片?

xxxxl888 commented 10 months ago

你好,我想问问我的训练过程中保存的模型文件加载不出来,训练结果也是空的,loss值也没收敛,可能存在什么问题? You can read my response above ,may be it can help u.

Hello, would it be okay if I ask you a few questions? I'm wondering if it's normal for the first training result image I saved during the training process to be blank. Also, I noticed that the images you showed above had keypoint maps and clothing segmentation results, but my saved images don't seem to have those. Is there something I'm missing or doing wrong?And I have looked at your solution, but it still doesn't clearly solve my problem。 ![Uploading 微信图片_20231215213524.jpg…]()