Reproducibility on Joint Training for Graph Classification

WMX567 commented 1 year ago

Hi,

I try using the codes to reproduce the results in the paper. I did not change any line of code and I could not reproduce the results for Joint training and Learning without forgetting. Their values are extremely low. I run with the command:

python /scratch1/mengxiwu/CLGL/GCGL/train.py --dataset Aromaticity-CL --method jointtrain --backbone GCN - -gpu 0 --clsIL False

python /scratch1/mengxiwu/CLGL/GCGL/train.py --dataset Aromaticity-CL --method lwf --backbone GCN --gpu 0 --clsIL False

Many thanks!

QueuQ commented 1 year ago

Hi,

I try using the codes to reproduce the results in the paper. I did not change any line of code and I could not reproduce the results for Joint training and Learning without forgetting. Their values are extremely low. I run with the command:

python /scratch1/mengxiwu/CLGL/GCGL/train.py --dataset Aromaticity-CL --method jointtrain --backbone GCN - -gpu 0 --clsIL False

python /scratch1/mengxiwu/CLGL/GCGL/train.py --dataset Aromaticity-CL --method lwf --backbone GCN --gpu 0 --clsIL False

Many thanks!

Hi, thanks for your interest,

I checked the code but didn't find out the potential problem. What results did you get? The LwF does not obtain high results according to our results. If the result is unreasonably low on task-IL setting, it sounds like that the class-IL are wrongly adopted (class-IL gets very low results).

WMX567 commented 1 year ago

For Joint Training, I got SIDER-tIL: AP: 50.11 Tox21-tIL: AP:49.40 Arom-CL: AP: 49.76

For LwF, I got SIDER-tIL: AP: 55.37 Tox21-tIL: AP: 62.97 Arom-CL: AP: 69.25

They are very different from the values presented in the paper. Maybe you could try running the codes again? I think only reading the codes would not provide very insightful feedback.

QueuQ commented 1 year ago

For Joint Training, I got SIDER-tIL: AP: 50.11 Tox21-tIL: AP:49.40 Arom-CL: AP: 49.76

For LwF, I got SIDER-tIL: AP: 55.37 Tox21-tIL: AP: 62.97 Arom-CL: AP: 69.25

They are very different from the values presented in the paper. Maybe you could try running the codes again? I think only reading the codes would not provide very insightful feedback.

Thanks for your information, your results are pretty strange, especially the joint training ones, those results around 50% looks like random guess. I did test the code before releasing the code and didn't meet this problem.

I'm currently busy on some other things, but I will definitely rerun the code recently and check for the possible problem and get back to you ASAP. Thanks for your patience!

QueuQ commented 1 year ago

For Joint Training, I got SIDER-tIL: AP: 50.11 Tox21-tIL: AP:49.40 Arom-CL: AP: 49.76

For LwF, I got SIDER-tIL: AP: 55.37 Tox21-tIL: AP: 62.97 Arom-CL: AP: 69.25

They are very different from the values presented in the paper. Maybe you could try running the codes again? I think only reading the codes would not provide very insightful feedback.

Hi, I have found the problem in the code.

For the lwf model, two loss functions were implemented in the code (from line 17 to 27 in the file GCGL/Baselines/lwf_model.py). The MultiClassCrossEntropy results in higher performance, and has now been corrected as the default function (line 174 in the file GCGL/Baselines/lwf_model.py). Besides, for lwf, setting batch size as 1000, 'lambda_dist' as 0.1, and 'T' as 0.2 seems to be good as tested recently.

For the jointtrain model, the reason that the results look same as random guess is because the parameter reset part of the jointtrain is incorrectly implemented and the model parameters are not included in the optimizer. Now this is corrected. In line 42-44 of the file GCGL/Baselines/jointtrain_model.py, a new method is added to reset the model. From line 352-357 in the file GCGL/pipeline.py, accompanying correction is also made. Besides, now users can choose whether to reset the model while using jointtrain. I think some explanation is also required regarding the parameter reset. In our code, each time a new task t comes, jointtrain would simultaneously learn all tasks from task 1 to task t. Before this simultaneous learning, users can choose whether reset the model parameters to random initialization or just use the model parameters from the last step (i.e. the model simultaneously trained from task 1 to t-1). Besides, jointtrtain works better with small batch sizes and number of eopchs, e.e. batchsize =16 and number of epochs =20, setting early stop to be true patience to be 100.

WeiWeic6222848 commented 1 year ago

Hi, I do not see a recent commit for this issue, can I find it somewhere? I attempted jointtraining on aromacity-CL and obtained similar result as WMX567

tskIL/0.8/val_Aromaticity_CL_2_jointtrainNoneGCN6464__False_bs32_20_2 0: AP: 0.5 AF: 0.0016
./results/tskIL/0.8/te_Aromaticity_CL_2_jointtrainNoneGCN6464__False_bs32_20_2.pkl AP: 0.4746 AF: -0.0284

Also, batch size 16 throws an exception at task 2 where one batch has batch size 1, causing the batch norm to error out, which is why I had to use bs32.

For the jointtrain model, the reason that the results look same as random guess is because the parameter reset part of the jointtrain is incorrectly implemented and the model parameters are not included in the optimizer. Now this is corrected. In line 42-44 of the file GCGL/Baselines/jointtrain_model.py, a new method is added to reset the model. From line 352-357 in the file GCGL/pipeline.py, accompanying correction is also made. Besides, now users can choose whether to reset the model while using jointtrain. I think some explanation is also required regarding the parameter reset. In our code, each time a new task t comes, jointtrain would simultaneously learn all tasks from task 1 to task t. Before this simultaneous learning, users can choose whether reset the model parameters to random initialization or just use the model parameters from the last step (i.e. the model simultaneously trained from task 1 to t-1). Besides, jointtrtain works better with small batch sizes and number of eopchs, e.e. batchsize =16 and number of epochs =20, setting early stop to be true patience to be 100.

WMX567 commented 1 year ago

I also found these two problems and corrected myself. However, I still could not get 0.69 (I got 0.65) on SIDER-tIL and the results are still low for Aromaticity-CL. Besides, What is your batch size when doing joint training?

WeiWeic6222848 commented 1 year ago

created a pull request for my implementation to get sider-til from 0.5 to 0.65 and Aromacity-CL from 0.5 to ~0.7 using settings provided in the paper (batch 128, epoch 100, patience 100)

WMX567 commented 1 year ago

Do you know why we need to train 2 classes at a time for Aromacity-CL when joint training? I train with all 31 classes and got 0.85. The training scheme does not sound correct to me. @WeiWeic6222848

WeiWeic6222848 commented 1 year ago

Do you know why we need to train 2 classes at a time for Aromacity-CL when joint training? I train with all 31 classes and got 0.85. The training scheme does not sound correct to me. @WeiWeic6222848

I believe it is necessary to create the pyramid visualization like in the readme, however, it seems like the code doesn't really work for GCGL. I changed this behaviour in my own use case too to learn all the classes in one task. @WMX567 Also since the parameter is randomized for jointtrain at the start of each task, the performance of training sequentially shouldn't be different than training all at once (data loader for jointtrain should normally load all tasks previously learnt too, i.e. when learning final task, jointtrain actually is training on data of current and all previous tasks.)

The performance difference may be because of the one extra class that is not included in any of the tasks since each task is 2 class. It's only my assumption and we should wait for concrete answer of the author.

WMX567 commented 1 year ago

I see. Thank you! @WeiWeic6222848

QueuQ commented 1 year ago

Hi, I do not see a recent commit for this issue, can I find it somewhere? I attempted jointtraining on aromacity-CL and obtained similar result as WMX567
* tskIL/0.8/val_Aromaticity_CL_2_jointtrain__None__GCN__64__64__False_bs32_20_2 0:
  AP:  0.5
  AF:  0.0016

* ./results/tskIL/0.8/te_Aromaticity_CL_2_jointtrain__None__GCN__64__64__False_bs32_20_2.pkl
  AP:  0.4746
  AF:  -0.0284
Also, batch size 16 throws an exception at task 2 where one batch has batch size 1, causing the batch norm to error out, which is why I had to use bs32.

For the jointtrain model, the reason that the results look same as random guess is because the parameter reset part of the jointtrain is incorrectly implemented and the model parameters are not included in the optimizer. Now this is corrected. In line 42-44 of the file GCGL/Baselines/jointtrain_model.py, a new method is added to reset the model. From line 352-357 in the file GCGL/pipeline.py, accompanying correction is also made. Besides, now users can choose whether to reset the model while using jointtrain. I think some explanation is also required regarding the parameter reset. In our code, each time a new task t comes, jointtrain would simultaneously learn all tasks from task 1 to task t. Before this simultaneous learning, users can choose whether reset the model parameters to random initialization or just use the model parameters from the last step (i.e. the model simultaneously trained from task 1 to t-1). Besides, jointtrtain works better with small batch sizes and number of eopchs, e.e. batchsize =16 and number of epochs =20, setting early stop to be true patience to be 100.

Hi, could you see the update now, the commit was somehow blocked before, but should be visible now. I will check the bug on Aromaticity ASAP, with the setting I mentioned in last reply, on SIDER, in my recent attempts I got results around 0.68,

QueuQ commented 1 year ago

I also found these two problems and corrected myself. However, I still could not get 0.69 (I got 0.65) on SIDER-tIL and the results are still low for Aromaticity-CL. Besides, What is your batch size when doing joint training?

The batch size 16 seems to be good on SIDER, I will check the problem in Arom ASAP

QueuQ commented 1 year ago

Do you know why we need to train 2 classes at a time for Aromacity-CL when joint training? I train with all 31 classes and got 0.85. The training scheme does not sound correct to me. @WeiWeic6222848

This is actually a problem we carefully considered. Initially, we just intuitively train all classes simultaneously. However, later we realized this is inconsistent with the other baselines. Specifically, in the task-IL setting, for the other baselines, each task is a 2-class classification during both training and testing. For jointtrain, it should be jointly trained on all 'tasks'. Therefore, only when we split the classes into 2 class tasks, each task is a 2-class classification task and is same as the task for other baselines. If the jointtrain is only trained on all classes (e.g. 20 classes), then the task for jointtrain is one task with 20 classes, which is inconsistent with the 2-class tasks for other baselines. For 2-class classification, the model only has two output heads (dimension of output logits), given a datum, the output is only 2-dimensional, while for a 20 class classification, for each datum, the model output is 20 dimensional. These two cases are essentially different in terms of model optimization.

To summarize, both training all classes as one task and splitting the tasks into small tasks may be called 'joint training', but the splitting ensures a more consistent setting for jointtrain and other baselines.

QueuQ commented 1 year ago

Do you know why we need to train 2 classes at a time for Aromacity-CL when joint training? I train with all 31 classes and got 0.85. The training scheme does not sound correct to me. @WeiWeic6222848

I believe it is necessary to create the pyramid visualization like in the readme, however, it seems like the code doesn't really work for GCGL. I changed this behaviour in my own use case too to learn all the classes in one task. @WMX567 Also since the parameter is randomized for jointtrain at the start of each task, the performance of training sequentially shouldn't be different than training all at once (data loader for jointtrain should normally load all tasks previously learnt too, i.e. when learning final task, jointtrain actually is training on data of current and all previous tasks.)

The performance difference may be because of the one extra class that is not included in any of the tasks since each task is 2 class. It's only my assumption and we should wait for concrete answer of the author.

It is correct that with the random initialization at the beginning of each task, the learning at different stages in the sequence are independent, and the final task learning is same as training all tasks together (but please note the difference of training with all classes together and training with all tasks together, as mentioned in another reply). The reason we still design such a sequential learning is that we want to show the performance of jointly training on different number of tasks. At the beginning, the performance is training on one task, then the performance is jointly training on two tasks, so on and so forth.

Besides, in last update, users can now specify whether to reset the parameter of the model during joint training. If we do not reset the parameters, things are different and the performance would generally increase, as briefly explained in another reply, since other baselines do not start afresh, resetting the parameters for joint training seems inconsistent and will increase the optimization difficulty for joint training. Without resetting the parameters, when each new task comes, jointtrain starts from a model status that is good for all previous tasks, then it may be easier for the model to find a new status that is also good for the new task.

WMX567 commented 1 year ago

I also found these two problems and corrected myself. However, I still could not get 0.69 (I got 0.65) on SIDER-tIL and the results are still low for Aromaticity-CL. Besides, What is your batch size when doing joint training?

The batch size 16 seems to be good on SIDER, I will check the problem in Arom ASAP

Thanks you so much! What about the batch size for LWF? Also, please let us know the batch size for Arom as well after you find the problem.

WeiWeic6222848 commented 1 year ago

Hi, could you see the update now, the commit was somehow blocked before, but should be visible now. I will check the bug on Aromaticity ASAP, with the setting I mentioned in last reply, on SIDER, in my recent attempts I got results around 0.68,

Yes, thanks for the update!

QueuQ commented 1 year ago

I also found these two problems and corrected myself. However, I still could not get 0.69 (I got 0.65) on SIDER-tIL and the results are still low for Aromaticity-CL. Besides, What is your batch size when doing joint training?

The batch size 16 seems to be good on SIDER, I will check the problem in Arom ASAP

Thanks you so much! What about the batch size for LWF? Also, please let us know the batch size for Arom as well after you find the problem.

Sure. About the batch size for LwF on SIDER, as mentioned in my previous reply, setting batch size as 1000, 'lambda_dist' as 0.1, and 'T' as 0.2 seems to be good, the epochs should be 100. For Arom with LwF, training with 20 epochs with 1000 batchsize 'lambda_dist' as 0.1, and 'T' as 0.2, turning on early stop, is good in my latest attempt.

As for the joint training, I have fixed two bugs. The first is still the parameter reset problem,

while the second is more like an improvement. During learning each task, now the data from different classes are mixed and shuffled, in each batch, the data may come from different tasks. The loss of different tasks are calculated separately and then summed up as the joint training loss for the current batch. https://github.com/QueuQ/CGLB/blob/3e0debf02e582610d05274b44c4c09fc1c1fe4b2/GCGL/Baselines/jointtrain_model.py#L98-L119 I tested it with 1000 batch size and 20 epochs

QueuQ / CGLB

Reproducibility on Joint Training for Graph Classification #4