fani-lab / OpeNTF

Neural machine learning methods for Team Formation problem.
Other
18 stars 12 forks source link

Hyperparameter Study for neural models #179

Open VaghehDashti opened 1 year ago

VaghehDashti commented 1 year ago

Hello @hosseinfani, I created this issue to put the updates for the hyperparameter study of temporal team formation. I started the run with 2 layers of [64,128] on all models [bnn, bnn_emb, tbnn, tbnn_emb, tbnn_dt2v_emb] and on the three datasets (15 runs in total). I will run the models with 3 layers of [64,128,256] afterwards.

hosseinfani commented 1 year ago

@VaghehDashti the choices of layers should be like [128, 64, 128], that is narrowing down and expanding. Please redo for this setting. Also, do this only for bnn and fnn.

VaghehDashti commented 1 year ago

@VaghehDashti the choices of layers should be like [128, 64, 128], that is narrowing down and expanding. Please redo for this setting. Also, do this only for bnn and fnn.

@hosseinfani I will do that for 3 layers. So for two layers it should be [128, 64]? Also, could you please explain why fnn? As we discussed earlier, we're starting from the best model from the negative sampling paper, and comparing with new temporal models. I think the chosen models are appropriate for our temporal hyperparameter study. Please let me know what you think.

VaghehDashti commented 1 year ago

Hi @hosseinfani, As we discussed over the phone, I have started the hyperparameter study for bnn and bnn_emb with layers [128,64,128]. I will wait to see how long it will take to run with these hyperparameters then will run [256,128,64,128,256]. Then I will move forward with the best model to the next step (either #negative samples or dim of input embedding). Please let me know what you think.

hosseinfani commented 1 year ago

@VaghehDashti Agree.

hosseinfani commented 1 year ago

@VaghehDashti It might be also overfit. Have you seen the result on training set? What is the behaviour on valid set?

VaghehDashti commented 1 year ago

Hi @hosseinfani, That could be another possibility. Here are the training/validation loss for the datasets: dblp: this is for bnn with [128,64,128]. the loss for [256...256] looks similar. the other folds are similar as well. f4 train_valid_loss this for bnn_emb: same as bnn f4 train_valid_loss

imdb: this if for bnn with 3 layers, but both bnn and bnn_emb for 3 and 5 layers look similar. f4 train_valid_loss

uspt: this if for bnn with 3 layers, but bnn_emb for 3 and 5 layers look similar as well. bnn with 5 layers is not done yet, but it will probably look the same. f4 train_valid_loss

Please let me know what you think.

VaghehDashti commented 1 year ago

I should also mention that the range of loss is higher for all three datasets on both models compared to one layer.

hosseinfani commented 1 year ago

@VaghehDashti does not make sense. increasing epoch make it worse??

VaghehDashti commented 1 year ago

@hosseinfani, yes, that is strange, especially for the training set. The only thing that after thinking and a bit of searching came up was that the learning rate may be bigger than what it should be (I used the same learning rate as with 1 layer).

VaghehDashti commented 1 year ago

Hi @hosseinfani, I ran bnn and bnn_emb with 3 layers [128,64,128] on imdb with learning rate of 0.01 and 0.001. Now training and validation loss for both models decrease after each epoch. However, the performance of both models has decreased significantly. bnn with 0.01: f4 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | mean -- | -- P_2 | 0 P_5 | 0 P_10 | 0 recall_2 | 0 recall_5 | 0 recall_10 | 0 ndcg_cut_2 | 0 ndcg_cut_5 | 0 ndcg_cut_10 | 0 map_cut_2 | 0 map_cut_5 | 0 map_cut_10 | 0 aucroc | 0.488561

bnn_emb with 0.01: f4 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | mean -- | -- P_2 | 0.004255 P_5 | 0.003404 P_10 | 0.002128 recall_2 | 0.002837 recall_5 | 0.005674 recall_10 | 0.007092 ndcg_cut_2 | 0.005218 ndcg_cut_5 | 0.005852 ndcg_cut_10 | 0.006453 map_cut_2 | 0.002837 map_cut_5 | 0.003664 map_cut_10 | 0.003822 aucroc | 0.508516

bnn with 0.001: f4 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | mean -- | -- P_2 | 0 P_5 | 0 P_10 | 0 recall_2 | 0 recall_5 | 0 recall_10 | 0 ndcg_cut_2 | 0 ndcg_cut_5 | 0 ndcg_cut_10 | 0 map_cut_2 | 0 map_cut_5 | 0 map_cut_10 | 0 aucroc | 0.512166

bnn_emb with 0.001: f4 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | mean -- | -- P_2 | 0 P_5 | 0 P_10 | 0.001277 recall_2 | 0 recall_5 | 0 recall_10 | 0.004255 ndcg_cut_2 | 0 ndcg_cut_5 | 0 ndcg_cut_10 | 0.001861 map_cut_2 | 0 map_cut_5 | 0 map_cut_10 | 0.000512 aucroc | 0.508119

I have started the same experiments for dblp and uspt. I will update when the results are ready. Should I also run the experiments with these learning rates for 5 layers [256,128,64,128,256]?

hosseinfani commented 1 year ago

@VaghehDashti thanks. now we can say overfitting is happening, right? No need to add more layers.

VaghehDashti commented 1 year ago

@hosseinfani, Yes, it is overfitting the training set for imdb. If we see the same results for dblp and uspt, I think we can move forward to the next hyperparameter. I'll put the results here, when they are ready.

hosseinfani commented 1 year ago

@VaghehDashti I had a second look. I think the train loss and valid loss is very close. Sometimes valid loss is less. this also does not make sense! please study more. shouldn't be like that. Are they running with a temporal baseline for test on last years?

VaghehDashti commented 1 year ago

@hosseinfani I checked the training/validation loss from the models from our previous experiments and in all of them the training/validation loss are close and sometimes even the validation loss goes lower for some folds, then I checked the training/validation loss of the other folds for the new experiments and it training loss is lower. Here are some samples: bnn with lr 0.01: f1 train_valid_loss bnn_emb with lr 0.001: f2 train_valid_loss

Are they running with a temporal baseline for test on last years?

No

hosseinfani commented 1 year ago

@VaghehDashti How do you interpret this? The model can generalized very well to valid set during training but cannot do so on test set?

VaghehDashti commented 1 year ago

@hosseinfani, After more thinking, if the model was overfitting the training data, the validation loss should have increased at the later epochs. one reason that the model is performing well on validation but not on the test set is that we take the last year as test set but shuffle the remaining data for training/validation, so the distribution of validation and training sets are more close to each other. Another notable idea is that the loss is decreasing but it is still higher than the loss of the models from our paper's experiments. e.g. the loss for bnn with 1 layer on imdb: image loss for bnn_emb with 1 layer on imdb: image

We can see that for bnn_emb with 3 layers the loss is closer to the loss of bnn_emb with 1 layer and the IR metrics of bnn_emb with 3 layer is better than bnn with 3 layers where the loss way higher than bnn with 1 layer.

hosseinfani commented 1 year ago

@VaghehDashti that's why I asked you the last question. you replied No though!

VaghehDashti commented 1 year ago

@hosseinfani, Oh my bad. I thought you meant if I'm running with streaming learning.

VaghehDashti commented 1 year ago

@hosseinfani, I ran the bnn with [128,64,128] with the normal train/valid/test split as we discussed yesterday. here is the results: f2 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

P_2 | 0.002222 -- | -- P_5 | 0.002857 P_10 | 0.002413 recall_2 | 0.00104 recall_5 | 0.003499 recall_10 | 0.005978 ndcg_cut_2 | 0.002161 ndcg_cut_5 | 0.003079 ndcg_cut_10 | 0.004195 map_cut_2 | 0.000715 map_cut_5 | 0.00139 map_cut_10 | 0.001723 aucroc | 0.606182

This time since the distribution of test set and train/valid are similar, the model performs properly on the test set. However, comparing them with the results of bnn with 1 layer of 128 nodes, the performance has decreased which shows that overfitting is happening. Shall I move on to the next hyperparameter using 1 layer of 128 nodes? What should be the next hyperparameter? I was thinking about experimenting with the dimension of team2vec embeddings. Please let me know what you think.

hosseinfani commented 1 year ago

@VaghehDashti thanks. Foremost, explain your findings about the number of layers and sizes here in a formal way. What was the settings, db, ... Then, you can go ahead with the #nns.

VaghehDashti commented 1 year ago

@hosseinfani sure. Here is a brief summary of our experimentation until now:

Please let me know if I can move forward with hyperparameter study of number of negative samples.

hosseinfani commented 1 year ago

since bnn is the baseline, I think it's better start with number of sampling in bnn without the nns. Then for the best number of sampling in bnn, we go for nns.

VaghehDashti commented 1 year ago

@hosseinfani sure, I just fixed the bnn.py and now we can experiment with the number of bayesian sampling. I am going to try number of samples 5, 10, 20. Is that okay or do you have other suggestions?

hosseinfani commented 1 year ago

that's fine

VaghehDashti commented 1 year ago

Hi @hosseinfani, Hope you are doing well. Here are the results for bnn and bnn_emb on imdb with #bs (bayesian samples) 1,5,10,20.

the train/validation loss and the results for imdb from original with bnn and #bs = 1: f2 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

P_2 | 0.007994 -- | -- P_5 | 0.008164 P_10 | 0.007533 recall_2 | 0.003541 recall_5 | 0.009167 recall_10 | 0.017064 ndcg_cut_2 | 0.008046 ndcg_cut_5 | 0.009022 ndcg_cut_10 | 0.012736 map_cut_2 | 0.002746 map_cut_5 | 0.004386 map_cut_10 | 0.0055 aucroc | 0.642866

bnn with #bs = 5: f2 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

P_2 | 0 -- | -- P_5 | 0.000851 P_10 | 0.000851 recall_2 | 0 recall_5 | 0.001418 recall_10 | 0.002837 ndcg_cut_2 | 0 ndcg_cut_5 | 0.000998 ndcg_cut_10 | 0.00171 map_cut_2 | 0 map_cut_5 | 0.000473 map_cut_10 | 0.000709 aucroc | 0.539404

bnn with #bs=10: f2 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

P_2 | 0 -- | -- P_5 | 0 P_10 | 0.000426 recall_2 | 0 recall_5 | 0 recall_10 | 0.001418 ndcg_cut_2 | 0 ndcg_cut_5 | 0 ndcg_cut_10 | 0.000711 map_cut_2 | 0 map_cut_5 | 0 map_cut_10 | 0.000236 aucroc | 0.527167

and finally bnn with #bs = 20: f2 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

P_2 | 0 -- | -- P_5 | 0.000851 P_10 | 0.000426 recall_2 | 0 recall_5 | 0.001418 recall_10 | 0.001418 ndcg_cut_2 | 0 ndcg_cut_5 | 0.000773 ndcg_cut_10 | 0.000773 map_cut_2 | 0 map_cut_5 | 0.000284 map_cut_10 | 0.000284 aucroc | 0.528675

bnn_emb with #bs=1: f2 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

P_2 | 0.004255 -- | -- P_5 | 0.005106 P_10 | 0.006383 recall_2 | 0.002837 recall_5 | 0.008511 recall_10 | 0.019574 ndcg_cut_2 | 0.003292 ndcg_cut_5 | 0.005923 ndcg_cut_10 | 0.011358 map_cut_2 | 0.001418 map_cut_5 | 0.002813 map_cut_10 | 0.004389 aucroc | 0.518159

bnn_emb with #bs=5: f2 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

P_2 | 0.002128 -- | -- P_5 | 0.001702 P_10 | 0.001277 recall_2 | 0.001418 recall_5 | 0.002837 recall_10 | 0.004255 ndcg_cut_2 | 0.001646 ndcg_cut_5 | 0.002032 ndcg_cut_10 | 0.002634 map_cut_2 | 0.000709 map_cut_5 | 0.000993 map_cut_10 | 0.001151 aucroc | 0.530013

bnn_emb with #bs=10:

f2 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

P_2 | 0.004255 -- | -- P_5 | 0.001702 P_10 | 0.002979 recall_2 | 0.002837 recall_5 | 0.002837 recall_10 | 0.009362 ndcg_cut_2 | 0.004255 ndcg_cut_5 | 0.003257 ndcg_cut_10 | 0.006238 map_cut_2 | 0.002128 map_cut_5 | 0.002128 map_cut_10 | 0.002943 aucroc | 0.531103

bnn_emb with #bs=20: f2 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

P_2 | 0 -- | -- P_5 | 0 P_10 | 0.000426 recall_2 | 0 recall_5 | 0 recall_10 | 0.001418 ndcg_cut_2 | 0 ndcg_cut_5 | 0 ndcg_cut_10 | 0.00063 map_cut_2 | 0 map_cut_5 | 0 map_cut_10 | 0.000177 aucroc | 0.53638

The results show that increasing the #bs on imdb decreases the performance significantly, but increases the performance for bnn_emb slightly in terms o auc but decreases the performance on ir-metrics.

For dblp, the results for bnn and bnn_emb for #bs=5 are ready, but for uspt I only have bnn_emb with #bs=5 and not bnn. Since I put experiments on computecanada, I have to specify the time it takes to run the model, but I couldn't predict exactly how much longer it takes with higher #bs, so I put it for 4 days but it threw out of time error, after only complete training of 2 folds! so I put it for 8 days this time but it has not started running after more than a day :( Also, if it needs 8 days for #bs=5, I don't know how much is it going to take for #bs=10 and #bs=20. Even dblp with #bs of 10 and 20 will probably need more than a week as well after the job starts on computecanda. Anyways, I'll put the findings from the completed experiments on dblp and uspt. Here are the results for dblp:

bnn with #bs=1:

f2 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

P_2 | 0.00057 -- | -- P_5 | 0.000663 P_10 | 0.00071 recall_2 | 0.000351 recall_5 | 0.000993 recall_10 | 0.002118 ndcg_cut_2 | 0.000538 ndcg_cut_5 | 0.000806 ndcg_cut_10 | 0.00133 map_cut_2 | 0.000242 map_cut_5 | 0.000411 map_cut_10 | 0.000558 aucroc | 0.63521

bnn with #bs=5:

f2 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

P_2 | 0.000537 -- | -- P_5 | 0.000449 P_10 | 0.000414 recall_2 | 0.000329 recall_5 | 0.000698 recall_10 | 0.001275 ndcg_cut_2 | 0.000585 ndcg_cut_5 | 0.000667 ndcg_cut_10 | 0.000932 map_cut_2 | 0.000279 map_cut_5 | 0.000395 map_cut_10 | 0.000473 aucroc | 0.550672

bnn_emb with #bs=1: f2 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

P_2 | 0.001124 -- | -- P_5 | 0.00129 P_10 | 0.001251 recall_2 | 0.000668 recall_5 | 0.001909 recall_10 | 0.003699 ndcg_cut_2 | 0.001083 ndcg_cut_5 | 0.001555 ndcg_cut_10 | 0.002397 map_cut_2 | 0.000474 map_cut_5 | 0.000792 map_cut_10 | 0.001033 aucroc | 0.668093

bnn_emb with #bs=5:

f2 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

P_2 | 0.00044 -- | -- P_5 | 0.000482 P_10 | 0.000404 recall_2 | 0.000269 recall_5 | 0.000742 recall_10 | 0.001215 ndcg_cut_2 | 0.00048 ndcg_cut_5 | 0.000651 ndcg_cut_10 | 0.000873 map_cut_2 | 0.000232 map_cut_5 | 0.000361 map_cut_10 | 0.000422 aucroc | 0.564039

Here again we can see that increasing the #bs from 1 to 5 decreases the model's predictive power on dblp. The results of bnn_emb on uspt with #bs = 1 and 5 showcase the same trend.

uspt's results till now:

bnn_emb with #bs=1:

f2 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

P_2 | 0.003663 -- | -- P_5 | 0.004123 P_10 | 0.003748 recall_2 | 0.001608 recall_5 | 0.004509 recall_10 | 0.008141 ndcg_cut_2 | 0.003652 ndcg_cut_5 | 0.004531 ndcg_cut_10 | 0.006094 map_cut_2 | 0.001212 map_cut_5 | 0.002027 map_cut_10 | 0.002583 aucroc | 0.698485

bnn_emb with #bs=5:

f2 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

P_2 | 0.00071 -- | -- P_5 | 0.000782 P_10 | 0.000801 recall_2 | 0.000372 recall_5 | 0.001011 recall_10 | 0.002072 ndcg_cut_2 | 0.000697 ndcg_cut_5 | 0.000913 ndcg_cut_10 | 0.001404 map_cut_2 | 0.000272 map_cut_5 | 0.00044 map_cut_10 | 0.000581 aucroc | 0.589078

Here are my final thoughts on these results:

hosseinfani commented 1 year ago

@VaghehDashti Thanks. To have a better visual comparison:

this is in contradiction to my knowledge though. Usually increasing bs should have an improving effect up until some point.

VaghehDashti commented 1 year ago

@hosseinfani, The results for imdb:

  | P_2 | P_5 | P_10 | rec_2 | rec_5 | rec_10 | ndcg_2 | ndcg_5 | ndcg_10 | map_2 | map_5 | map_10 | aucroc -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- bnn #bs=1 | 0.0080 | 0.0082 | 0.0075 | 0.0035 | 0.0092 | 0.0171 | 0.0080 | 0.0090 | 0.0127 | 0.0027 | 0.0044 | 0.0055 | 0.6429 bnn #bs=5 | 0.0000 | 0.0009 | 0.0009 | 0.0000 | 0.0014 | 0.0028 | 0.0000 | 0.0010 | 0.0017 | 0.0000 | 0.0005 | 0.0007 | 0.5394 bnn #bs=10 | 0.0000 | 0.0000 | 0.0004 | 0.0000 | 0.0000 | 0.0014 | 0.0000 | 0.0000 | 0.0007 | 0.0000 | 0.0000 | 0.0002 | 0.5272 bnn #bs=20 | 0.0000 | 0.0009 | 0.0004 | 0.0000 | 0.0014 | 0.0014 | 0.0000 | 0.0008 | 0.0008 | 0.0000 | 0.0003 | 0.0003 | 0.5287 bnn_emb #bs=1 | 0.0043 | 0.0051 | 0.0064 | 0.0028 | 0.0085 | 0.0196 | 0.0033 | 0.0059 | 0.0114 | 0.0014 | 0.0028 | 0.0044 | 0.5182 bnn_emb #bs=5 | 0.0021 | 0.0017 | 0.0013 | 0.0014 | 0.0028 | 0.0043 | 0.0016 | 0.0020 | 0.0026 | 0.0007 | 0.0010 | 0.0012 | 0.5300 bnn_emb #bs=10 | 0.0043 | 0.0017 | 0.0030 | 0.0028 | 0.0028 | 0.0094 | 0.0043 | 0.0033 | 0.0062 | 0.0021 | 0.0021 | 0.0029 | 0.5311 bnn_emb #bs=20 | 0.0000 | 0.0000 | 0.0004 | 0.0000 | 0.0000 | 0.0014 | 0.0000 | 0.0000 | 0.0006 | 0.0000 | 0.0000 | 0.0002 | 0.5364

the results for dblp:

  | P_2 | P_5 | P_10 | rec_2 | rec_5 | rec_10 | ndcg_2 | ndcg_5 | ndcg_10 | map_2 | map_5 | map_10 | aucroc -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- bnn #bs=1 | 0.0006 | 0.0007 | 0.0007 | 0.0004 | 0.0010 | 0.0021 | 0.0005 | 0.0008 | 0.0013 | 0.0002 | 0.0004 | 0.0006 | 0.6352 bnn #bs=5 | 0.0005 | 0.0004 | 0.0004 | 0.0003 | 0.0007 | 0.0013 | 0.0006 | 0.0007 | 0.0009 | 0.0003 | 0.0004 | 0.0005 | 0.5507 bnn_emb #bs=1 | 0.0011 | 0.0013 | 0.0013 | 0.0007 | 0.0019 | 0.0037 | 0.0011 | 0.0016 | 0.0024 | 0.0005 | 0.0008 | 0.0010 | 0.6681 bnn_emb #bs=5 | 0.0004 | 0.0005 | 0.0004 | 0.0003 | 0.0007 | 0.0012 | 0.0005 | 0.0007 | 0.0009 | 0.0002 | 0.0004 | 0.0004 | 0.5640

and the results of bnn_emb with #ns=1&5 on uspt:

  | P_2 | P_5 | P_10 | rec_2 | rec_5 | rec_10 | ndcg_2 | ndcg_5 | ndcg_10 | map_2 | map_5 | map_10 | aucroc -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- bnn_emb #bs=1 | 0.0037 | 0.0041 | 0.0037 | 0.0016 | 0.0045 | 0.0081 | 0.0037 | 0.0045 | 0.0061 | 0.0012 | 0.0020 | 0.0026 | 0.6985 bnn_emb #bs=5 | 0.0007 | 0.0008 | 0.0008 | 0.0004 | 0.0010 | 0.0021 | 0.0007 | 0.0009 | 0.0014 | 0.0003 | 0.0004 | 0.0006 | 0.5891

draw the train/valid diagram in one figure for each dataset

I will have to write a code to create the figure you asked for. I will work on it but it will take some time, I will update here later.

this is in contradiction to my knowledge though. Usually increasing bs should have an improving effect up until some point.

I thought the same way and have not been able to come up with a reason till now. I will think about why this is happening more and let you know if I came up with anything. I would appreciate any ideas and help :)

VaghehDashti commented 1 year ago

@hosseinfani, Here are the train/val figures: dblp: it looks like the model is overfitting with #bs=5 (lower training loss and higher valid loss)

l 128 lr0 1 b128 e20 nns3 nsunigram_b f2 train_valid_loss

l 128 lr0 1 b128 e20 nns3 nsunigram_b f2 train_valid_loss

imdb: here again looks like with #bs=5 it is overfitting but for #bs 10 and 20, the model's training loss does not change much during training and the validation loss fluctuates respectively.

l 128 lr0 1 b128 e20 nns3 nsunigram_b f2 train_valid_loss

l 128 lr0 1 b128 e20 nns3 nsunigram_b f2 train_valid_loss

uspt: not sure what is going on.

l 128 lr0 1 b128 e20 nns3 nsunigram_b f2 train_valid_loss

Please let me know what you think.

hosseinfani commented 1 year ago

@VaghehDashti Thank you. Please

I have to look into the code and see what's going on.

VaghehDashti commented 1 year ago

Hi @hosseinfani, I pushed the updated code based on what you said to the misc folder, i.e. they are the average of folds and the train/val are the same color for each #bs. To run the code, you need to cd to src and then run:

python -u misc/report_loss.py

here is the results: l 128 lr0 1 b128 e20 nns3 nsunigram_b train_valid_loss

VaghehDashti commented 1 year ago

Hi @hosseinfani, Here is a brief explanation of how Bayesian neural networks work: In the forward propagation, for each input instance the model samples #bs weights and hence predicts #bs outputs, i.e. for each expert the model predicts #bs probabilities, then averages the #bs probabilities of each expert and uses the average probabilities for calculating the loss for that instance. After #batch_size of instances, the model averages the losses and using that moves forward to the backpropagation step. This is a bottleneck for large datasets such as uspt, where the shape of the weights for the first layer is 2x67315x128 (1 for matrix of mean and 1 for standard deviation). E.g., if #bs=20, for each instance for train set (somewhere around 0.85 * 152317), the model has to sample 2x67315x128 numbers 20 times and this is just the weights edges between input layer and hidden layer. As a result, training can take several weeks. Finally, as we discussed, we're going to use bnn_emb where the matrices will be of size (100,128) and (128,#experts) on dblp and imdb where the #instances are fewer than uspt with #bs up to 20.

Here are the train/val loss of bnn_emb on dblp and imdb with #bs up to 10. After we have the results for #bs = 20, I will update the issue.

l 128 lr0 1 b128 e20 nns3 nsunigram_b train_valid_loss

Performance on dblp:

  | P_2 | P_5 | P_10 | rec_2 | rec_5 | rec_10 | ndcg_2 | ndcg_5 | ndcg_10 | map_2 | map_5 | map_10 | aucroc -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- bnn_emb #bs=1 | 0.0011 | 0.0013 | 0.0013 | 0.0007 | 0.0019 | 0.0037 | 0.0011 | 0.0016 | 0.0024 | 0.0005 | 0.0008 | 0.0010 | 0.6681 bnn_emb #bs=5 | 0.0004 | 0.0005 | 0.0004 | 0.0003 | 0.0007 | 0.0012 | 0.0005 | 0.0007 | 0.0009 | 0.0002 | 0.0004 | 0.0004 | 0.5640 bnn_emb #bs=10 | 0.0004 | 0.0004 | 0.0003 | 0.0003 | 0.0006 | 0.0009 | 0.0004 | 0.0006 | 0.0007 | 0.0002 | 0.0003 | 0.0004 | 0.5507

Performance on imdb:

  | P_2 | P_5 | P_10 | rec_2 | rec_5 | rec_10 | ndcg_2 | ndcg_5 | ndcg_10 | map_2 | map_5 | map_10 | aucroc -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- bnn_emb #bs=1 | 0.0043 | 0.0051 | 0.0064 | 0.0028 | 0.0085 | 0.0196 | 0.0033 | 0.0059 | 0.0114 | 0.0014 | 0.0028 | 0.0044 | 0.5182 bnn_emb #bs=5 | 0.0021 | 0.0017 | 0.0013 | 0.0014 | 0.0028 | 0.0043 | 0.0016 | 0.0020 | 0.0026 | 0.0007 | 0.0010 | 0.0012 | 0.5300 bnn_emb #bs=10 | 0.0043 | 0.0017 | 0.0030 | 0.0028 | 0.0028 | 0.0094 | 0.0043 | 0.0033 | 0.0062 | 0.0021 | 0.0021 | 0.0029 | 0.5311

Here is an explanation of the results: The model is trying to decrease the loss on the training set and with higher #bs the distribution of weights for bnn(_emb) is overfitting the distribution of training set and hence cannot generalize well on validation/test sets.

Please let me know what you think :)

hosseinfani commented 1 year ago

@VaghehDashti I had a look at the code. Why at valid, we still do 1 sample? https://github.com/fani-lab/OpeNTF/blob/148c1c2defe1176563f162ad159b2ffe0af15ecc/src/mdl/bnn.py#L139

We can do s sample like train so the loss comparison becomes fair. Also, I think we need to change the test code to do average on s preditions.

I ran on dblp toy and here is the result before and after the fix. After the fix, up until some s, we see improvement. Also, for all s, although overfit happens, the valid loss range is lower after the fix.

https://docs.google.com/document/d/1T9uV--4afs3qp0GoSMpOItaqbTsqQ-RjYOiwMKbU9Gk/edit?usp=sharing

VaghehDashti commented 1 year ago

@hosseinfani, I think I followed josh feldman's blog/code when I hard-coded the #bs=1 for validaiton. Unfortunately, I cannot confirm because as you know his website is down. I will look at other implementations of bnn and see if they do the same thing and let you know.

hosseinfani commented 1 year ago

@VaghehDashti no need to check. as seen, the results become better. go ahead with s sampling during validation and averaging during the test.

VaghehDashti commented 1 year ago

@hosseinfani, Sure. I was thinking we can use #bs={3,5,10} instead of {5,10,20} to have the results sooner. What do you think?

hosseinfani commented 1 year ago

@VaghehDashti agree. But first see the result of toy datasets after averaging the predictions on test set.

VaghehDashti commented 1 year ago

@hosseinfani here are the results of bnn with #bs=3 and 20 epochs on toy-dblp without averaging the predictions:

f2 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | mean -- | -- P_2 | 0.055556 P_5 | 0.188889 P_10 | 0.166667 recall_2 | 0.055556 recall_5 | 0.444444 recall_10 | 0.768519 ndcg_cut_2 | 0.055556 ndcg_cut_5 | 0.267414 ndcg_cut_10 | 0.398129 map_cut_2 | 0.041667 map_cut_5 | 0.156173 map_cut_10 | 0.227513 aucroc | 0.466864

with averaging the predictions:

f2 train_valid_loss

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

P_2 | 0.055556 -- | -- P_5 | 0.088889 P_10 | 0.15 recall_2 | 0.046296 recall_5 | 0.212963 recall_10 | 0.694444 ndcg_cut_2 | 0.055556 ndcg_cut_5 | 0.144127 ndcg_cut_10 | 0.341101 map_cut_2 | 0.037037 map_cut_5 | 0.091204 map_cut_10 | 0.181338 aucroc | 0.391124

The training and validation loss are the same for both cases as expected (both use #bs for validation), but strangely the results are worse when we predict #bs times and average them! I have pushed the code, please review and let me know if it looks okay to you so I can start re-running the experiments on the real datasets.

hosseinfani commented 1 year ago

@VaghehDashti Please debug the code, line by line, see the predictions at each iteration, and find where there is a problem.

VaghehDashti commented 1 year ago

@hosseinfani I just finished debugging the code line by line and I couldn't find any problems. The most important thing that I tested several times is that when predicting #bs outputs in either training/validation/test the weights change every time and have different outputs and all of the shapes are correct.

VaghehDashti commented 1 year ago

@hosseinfani Here are the loss and results of bnn_emb on dblp and imdb with averaging predictions with #bs=3: dblp: f2 train_valid_loss

  | P_2 | P_5 | P_10 | rec_2 | rec_5 | rec_10 | ndcg_2 | ndcg_5 | ndcg_10 | map_2 | map_5 | map_10 | aucroc -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- bnn_emb #bs=1 | 0.0011 | 0.0013 | 0.0013 | 0.0007 | 0.0019 | 0.0037 | 0.0011 | 0.0016 | 0.0024 | 0.0005 | 0.0008 | 0.0010 | 0.6681 bnn_emb #bs=3 | 0.0020 | 0.0019 | 0.0018 | 0.0012 | 0.0027 | 0.0054 | 0.0021 | 0.0025 | 0.0037 | 0.0009 | 0.0013 | 0.0017 | 0.6656

imdb:

f2 train_valid_loss

  | P_2 | P_5 | P_10 | rec_2 | rec_5 | rec_10 | ndcg_2 | ndcg_5 | ndcg_10 | map_2 | map_5 | map_10 | aucroc -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- bnn_emb #bs=1 | 0.0043 | 0.0051 | 0.0064 | 0.0028 | 0.0085 | 0.0196 | 0.0033 | 0.0059 | 0.0114 | 0.0014 | 0.0028 | 0.0044 | 0.5182 bnn_emb #bs=3 | 0.0021 | 0.0026 | 0.0038 | 0.0014 | 0.0043 | 0.0105 | 0.0026 | 0.0038 | 0.0069 | 0.0014 | 0.0022 | 0.0031 | 0.5264

Uspt's not ready yet. For dblp, the ir-metrics have increased but for imdb, they have decreased. AUC has acted inversely.

The significant drop in performance on toy-dblp didn't happen on either dblp or imdb.

Update: uspt's results are here as well:

f3 train_valid_loss

  | P_2 | P_5 | P_10 | rec_2 | rec_5 | rec_10 | ndcg_2 | ndcg_5 | ndcg_10 | map_2 | map_5 | map_10 | aucroc -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- bnn_emb #bs=1 | 0.0011 | 0.0013 | 0.0013 | 0.0007 | 0.0019 | 0.0037 | 0.0011 | 0.0016 | 0.0024 | 0.0005 | 0.0008 | 0.0010 | 0.6681 bnn_emb #bs=3 | 0.0054 | 0.0045 | 0.0039 | 0.0028 | 0.0056 | 0.0094 | 0.0054 | 0.0056 | 0.0074 | 0.0021 | 0.0029 | 0.0034 | 0.6844

uspt has similar behavior to dblp as expected.

hosseinfani commented 1 year ago

@hosseinfani I just finished debugging the code line by line and I couldn't find any problems. The most important thing that I tested several times is that when predicting #bs outputs in either training/validation/test the weights change every time and have different outputs and all of the shapes are correct.

I know. But you can pick two random test instances and print the predictions for each #bs iteration . Then calculate the average and std to see if std is wide or narrow. It should be narrow I think.

hosseinfani commented 1 year ago

Please explain the reason behind the sudden drop in loss after the epoch here by mentioning the codline.

VaghehDashti commented 1 year ago

I know. But you can pick two random test instances and print the predictions for each #bs iteration . Then calculate the average and std to see if std is wide or narrow. It should be narrow I think.

@hosseinfani, The standard deviations are between 0.005 and 0.46 for each expert after 10 epochs on toy-dblp.

VaghehDashti commented 1 year ago

Please explain the reason behind the sudden drop in loss after the epoch here by mentioning the codline.

@hosseinfani, I believe the drop in loss is due to this line where we decrease the learning rate when the validation loss does not change significantly after 10 epochs.

https://github.com/fani-lab/OpeNTF/blob/148c1c2defe1176563f162ad159b2ffe0af15ecc/src/mdl/bnn.py#L111

hosseinfani commented 1 year ago

I know. But you can pick two random test instances and print the predictions for each #bs iteration . Then calculate the average and std to see if std is wide or narrow. It should be narrow I think.

@hosseinfani, The standard deviations are between 0.005 and 0.46 for each expert after 10 epochs on toy-dblp.

can you draw min-max-avg chart, x: experts, y: prob, sorted on decreasing avg prob: sth like this: image

hosseinfani commented 1 year ago

Please explain the reason behind the sudden drop in loss after the epoch here by mentioning the codline.

@hosseinfani, I believe the drop in loss is due to this line where we decrease the learning rate when the validation loss does not change significantly after 10 epochs.

https://github.com/fani-lab/OpeNTF/blob/148c1c2defe1176563f162ad159b2ffe0af15ecc/src/mdl/bnn.py#L111

can you try running on patience=2 but same 20 epochs on imdb or any dataset which gives you results faster?

VaghehDashti commented 1 year ago

can you try running on patience=2 but same 20 epochs on imdb or any dataset which gives you results faster?

@hosseinfani, Sure, I put it on imdb. It will take ~1.5 days. I'll update here.

can you draw min-max-avg chart, x: experts, y: prob, sorted on decreasing avg prob: sth like this:

image

Sure, I will work on it.

VaghehDashti commented 1 year ago

@hosseinfani Here is the plot, although it is not sorted by decreasing avg probability I think this will still give you the information you need.

This is based on the predictions on 1 instance with #bs=5: f2 test min-max-avg-plot

Here is the average on 5 instances of test set for toy-dblp with #bs=5: f2 test min-max-avg-plot

hosseinfani commented 1 year ago

@VaghehDashti Awesome. Can we have the second one for #bs=3, 10 also? If we could show that the larger the bs, the wider the min-max, then we explain the reason why larger #bs leads to the poor test result.

VaghehDashti commented 1 year ago

@hosseinfani Sorry that was for #bs = 3. Here is for #bs = 5:

f2 test min-max-avg-plot

bs=10:

f2 test min-max-avg-plot

bs=50:

f2 test min-max-avg-plot

As you said we can see why high #bs can lead to lower performance.