Closed fengh16 closed 1 year ago
Hi, Thanks for your attention. Try the following code:
elif args.role == 'KD':
label_loss = loss_fcn(logits[train_mask], labels[train_mask])
alpha = 0.7
loss = (1 - alpha) * label_loss + alpha * kd_ce_loss(logits, tea_logits, temperature=1.0)
optimizer.zero_grad()
loss.backward()
optimizer.step()
where the function kd_ce_loss
is defined as
def kd_ce_loss(logits_S, logits_T, temperature=1.0):
beta_logits_T = logits_T / temperature
beta_logits_S = logits_S / temperature
p_T = F.softmax(beta_logits_T, dim=-1)
loss = -(p_T * F.log_softmax(beta_logits_S, dim=-1)).sum(dim=-1).mean()
return loss * (temperature ** 2)
Then, I get following results.
Runned 10 times
Val Accs: [0.804, 0.81, 0.814, 0.792, 0.792, 0.81, 0.812, 0.804, 0.802, 0.806]
Test Accs: [0.815, 0.812, 0.819, 0.811, 0.816, 0.822, 0.826, 0.821, 0.815, 0.806]
Average val accuracy: 0.8046 ± 0.00726911273815449
Average test accuracy on cora: 0.8163 ± 0.005586591089385334
The results may be slightly different on different devices.
Thank you for your reply!
Dear authors,
Thank you for your excellent work. However, I have some problems reproducing your experimental results for the baseline KD [21]. The result of KD on Cora is 77.63% rather than 83.2%. I think that there should be some wrong settings, but I cannot figure it out. I would appreciate it if you can give some advice on how to reproduce the results of KD.
As far as I know, the traditional logit-based knowledge distillation method has an additional loss term. I added the following code in
node-level/stu-gcn/train.py
:and did some other fundamental changes.
However, the result of Cora is only 0.7763.