Training issue on Task-sharing generic path

Yunpeng-Qi commented 2 months ago

Thanks for your excellent work! After reading the paper, I'm still a bit confused about the Task-sharing generic path.

In Section 3.3. Mixture of Low-Rank Experts, the paragraph on the Task-sharing generic path mentions that "As all task features will go through this generic convolution, it will be optimized by the gradients of different tasks simultaneously, which can help extract common features among all tasks." However, it later states, “During the training process, we stop the gradients of this path for further back-propagation. The gradient is back-propagated through the other two paths.”

I'm really curious about how you train the Task-sharing generic path if the gradients are not being back-propagated through it. Could you please provide more details on this process? Thank you!

YuqiYang213 commented 2 months ago

I appreciate your interest in our work. Our paper describes the stop-gradient as "stop the gradients of this path for further back-propagation". The gradient from different tasks mentioned here will be only used to update the weight of generic convolution and not affect all the modules before it. In addition, you can also refer to this line for a better understanding of how it works. I hope this will help.

Yunpeng-Qi commented 2 months ago

Ok, thank you!

Yunpeng-Qi commented 2 months ago

Hi Yuqi, I'm still confused about the optimization of the generic path.

Based on PyTorch's default computation graph, the generic patch will be updated by the loss from each task by default. However, when using the operation: out = self.bn_alltask + self.conv2task * prob_mix[:, 0] + self.share_conv(out.detach())

it seems that the parameters of self.share_conv would not be updated? The detach() operation should prevent self.share_conv from being part of the gradient calculation.

I still don't fully understand the purpose of using detach() in this case. Could you kindly clarify this?

My understanding is as follows: you first update the parameters of the task generic path using the multi-task loss only. Afterward, you stop the gradient of the task generic path and optimize MoLoRA modules. Could you confirm if this understanding is correct?

Thank you very much for your time and assistance！

Best regards, Yunpeng

-----原始邮件----- 发件人:Yangyq @.> 发送时间:2024-08-29 20:49:44 (星期四) 收件人: YuqiYang213/MLoRE @.> 抄送: "Yunpeng Qi" @.>, Author @.> 主题: Re: [YuqiYang213/MLoRE] Training issue on Task-sharing generic path (Issue #3)

I appreciate your interest in our work. Our paper describes the stop-gradient as "stop the gradients of this path for further back-propagation". The gradient from different tasks mentioned here will be only used to update the weight of generic convolution and not affect all the modules before it. In addition, you can also refer to this line for a better understanding of how it works. I hope this will help.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

YuqiYang213 commented 2 months ago

Sorry for the late reply. We train our MLoRE model with one training phrase end-to-end, and all the parameters in our MLoRE model will be updated by the gradients. The detach() operation stops the gradients of the out tensor from being propagated to the modules before it. However, the gradient from the multi-task loss to the self.share_conv will be propagated as usual. In conclusion, the parameters of self.share_conv will be updated according to the back-propagated gradient, but no gradient will be propagated to the modules before it through this branch. If you still feel confused, you can print the gradient of self.share_conv.weight during training, which will give you a better understanding. I hope this can address your confusion.

Yunpeng-Qi commented 2 months ago

Thank you so much!

-----原始邮件----- 发件人:Yangyq @.> 发送时间:2024-09-12 14:53:37 (星期四) 收件人: YuqiYang213/MLoRE @.> 抄送: "Yunpeng Qi" @.>, Author @.> 主题: Re: [YuqiYang213/MLoRE] Training issue on Task-sharing generic path (Issue #3)

Sorry for the late reply. We train our MLoRE model with one training phrase end-to-end, and all the parameters in our MLoRE model will be updated by the gradients. The detach() operation stops the gradients of the out tensor from being propagated to the modules before it. However, the gradient from the multi-task loss to the self.share_conv will be propagated as usual. In conclusion, the parameters of self.share_conv will be updated according to the back-propagated gradient, but no gradient will be propagated to the modules before it through this branch. If you still feel confused, you can print the gradient of self.share_conv.weight during training, which will give you a better understanding. I hope this can address your confusion.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

YuqiYang213 / MLoRE

Training issue on Task-sharing generic path #3