VaghehDashti commented 1 year ago

Hello @hosseinfani, I created this issue to put the updates for the hyperparameter study of temporal team formation. I started the run with 2 layers of [64,128] on all models [bnn, bnn_emb, tbnn, tbnn_emb, tbnn_dt2v_emb] and on the three datasets (15 runs in total). I will run the models with 3 layers of [64,128,256] afterwards.

hosseinfani commented 1 year ago

@VaghehDashti the choices of layers should be like [128, 64, 128], that is narrowing down and expanding. Please redo for this setting. Also, do this only for bnn and fnn.

VaghehDashti commented 1 year ago

@VaghehDashti the choices of layers should be like [128, 64, 128], that is narrowing down and expanding. Please redo for this setting. Also, do this only for bnn and fnn.

@hosseinfani I will do that for 3 layers. So for two layers it should be [128, 64]? Also, could you please explain why fnn? As we discussed earlier, we're starting from the best model from the negative sampling paper, and comparing with new temporal models. I think the chosen models are appropriate for our temporal hyperparameter study. Please let me know what you think.

VaghehDashti commented 1 year ago

Hi @hosseinfani, As we discussed over the phone, I have started the hyperparameter study for bnn and bnn_emb with layers [128,64,128]. I will wait to see how long it will take to run with these hyperparameters then will run [256,128,64,128,256]. Then I will move forward with the best model to the next step (either #negative samples or dim of input embedding). Please let me know what you think.

hosseinfani commented 1 year ago

@VaghehDashti Agree.

hosseinfani commented 1 year ago

@VaghehDashti It might be also overfit. Have you seen the result on training set? What is the behaviour on valid set?

VaghehDashti commented 1 year ago

Hi @hosseinfani, That could be another possibility. Here are the training/validation loss for the datasets: dblp: this is for bnn with [128,64,128]. the loss for [256...256] looks similar. the other folds are similar as well. f4 train_valid_loss this for bnn_emb: same as bnn

imdb: this if for bnn with 3 layers, but both bnn and bnn_emb for 3 and 5 layers look similar. f4 train_valid_loss

uspt: this if for bnn with 3 layers, but bnn_emb for 3 and 5 layers look similar as well. bnn with 5 layers is not done yet, but it will probably look the same. f4 train_valid_loss

Please let me know what you think.

VaghehDashti commented 1 year ago

I should also mention that the range of loss is higher for all three datasets on both models compared to one layer.

hosseinfani commented 1 year ago

@VaghehDashti does not make sense. increasing epoch make it worse??

VaghehDashti commented 1 year ago

@hosseinfani, yes, that is strange, especially for the training set. The only thing that after thinking and a bit of searching came up was that the learning rate may be bigger than what it should be (I used the same learning rate as with 1 layer).

VaghehDashti commented 1 year ago

Hi @hosseinfani, I ran bnn and bnn_emb with 3 layers [128,64,128] on imdb with learning rate of 0.01 and 0.001. Now training and validation loss for both models decrease after each epoch. However, the performance of both models has decreased significantly. bnn with 0.01: f4 train_valid_loss

bnn_emb with 0.01: f4 train_valid_loss

bnn with 0.001: f4 train_valid_loss

bnn_emb with 0.001: f4 train_valid_loss

I have started the same experiments for dblp and uspt. I will update when the results are ready. Should I also run the experiments with these learning rates for 5 layers [256,128,64,128,256]?

hosseinfani commented 1 year ago

@VaghehDashti thanks. now we can say overfitting is happening, right? No need to add more layers.

VaghehDashti commented 1 year ago

@hosseinfani, Yes, it is overfitting the training set for imdb. If we see the same results for dblp and uspt, I think we can move forward to the next hyperparameter. I'll put the results here, when they are ready.

hosseinfani commented 1 year ago

@VaghehDashti I had a second look. I think the train loss and valid loss is very close. Sometimes valid loss is less. this also does not make sense! please study more. shouldn't be like that. Are they running with a temporal baseline for test on last years?

VaghehDashti commented 1 year ago

@hosseinfani I checked the training/validation loss from the models from our previous experiments and in all of them the training/validation loss are close and sometimes even the validation loss goes lower for some folds, then I checked the training/validation loss of the other folds for the new experiments and it training loss is lower. Here are some samples: bnn with lr 0.01: f1 train_valid_loss bnn_emb with lr 0.001: f2 train_valid_loss

Are they running with a temporal baseline for test on last years?

No

hosseinfani commented 1 year ago

@VaghehDashti How do you interpret this? The model can generalized very well to valid set during training but cannot do so on test set?

VaghehDashti commented 1 year ago

@hosseinfani, After more thinking, if the model was overfitting the training data, the validation loss should have increased at the later epochs. one reason that the model is performing well on validation but not on the test set is that we take the last year as test set but shuffle the remaining data for training/validation, so the distribution of validation and training sets are more close to each other. Another notable idea is that the loss is decreasing but it is still higher than the loss of the models from our paper's experiments. e.g. the loss for bnn with 1 layer on imdb: loss for bnn_emb with 1 layer on imdb:

We can see that for bnn_emb with 3 layers the loss is closer to the loss of bnn_emb with 1 layer and the IR metrics of bnn_emb with 3 layer is better than bnn with 3 layers where the loss way higher than bnn with 1 layer.

hosseinfani commented 1 year ago

@VaghehDashti that's why I asked you the last question. you replied No though!

VaghehDashti commented 1 year ago

@hosseinfani, Oh my bad. I thought you meant if I'm running with streaming learning.

VaghehDashti commented 1 year ago

@hosseinfani, I ran the bnn with [128,64,128] with the normal train/valid/test split as we discussed yesterday. here is the results: f2 train_valid_loss

This time since the distribution of test set and train/valid are similar, the model performs properly on the test set. However, comparing them with the results of bnn with 1 layer of 128 nodes, the performance has decreased which shows that overfitting is happening. Shall I move on to the next hyperparameter using 1 layer of 128 nodes? What should be the next hyperparameter? I was thinking about experimenting with the dimension of team2vec embeddings. Please let me know what you think.

hosseinfani commented 1 year ago

@VaghehDashti thanks. Foremost, explain your findings about the number of layers and sizes here in a formal way. What was the settings, db, ... Then, you can go ahead with the #nns.

VaghehDashti commented 1 year ago

@hosseinfani sure. Here is a brief summary of our experimentation until now:

We conducted our experiments using bnn and bnn_emb due to their stronger performance overall 3 datasets based on our negative sampling paper.
To start our hyperparameter study, we increased the number of layers from 1 to 3 and 5 with the following formation of nodes [128,64,128] and [256,128,64,128,256] using the base learning rate of 0.1 from the paper's experiments, 3 negative samples using unigram_b and embedding size of 100.
We saw that with learning rate of 0.1 the model's train/validation loss increases after each epoch for all 3 datasets instead of decreasing.
As a result, we guessed that the train/validation loss was increasing because the learning rate was too high, so we investigated learning rates of 0.01 and 0.001.
Now, the train/validation loss decreases after each epoch, but the performance of the models in all 3 datasets is lower than using 1 layer.
We guessed that this is due to overfitting on the train set that has a different distribution compared to test set which is taken from the last year of the dataset.
To confirm our guess, we ran an experiment on imdb using 3 layers of [128,64,128] and with the normal train/valid/test split where the distribution of train and test set are similar to each other and saw that the model is performing better than before, but still has a bit lower performance compared to using 1 layer, reaffirming the fact the larger model is overfitting the train data and cannot generalize as well as a smaller model.

Please let me know if I can move forward with hyperparameter study of number of negative samples.

hosseinfani commented 1 year ago

since bnn is the baseline, I think it's better start with number of sampling in bnn without the nns. Then for the best number of sampling in bnn, we go for nns.

VaghehDashti commented 1 year ago

@hosseinfani sure, I just fixed the bnn.py and now we can experiment with the number of bayesian sampling. I am going to try number of samples 5, 10, 20. Is that okay or do you have other suggestions?

hosseinfani commented 1 year ago

that's fine

VaghehDashti commented 1 year ago

Hi @hosseinfani, Hope you are doing well. Here are the results for bnn and bnn_emb on imdb with #bs (bayesian samples) 1,5,10,20.

the train/validation loss and the results for imdb from original with bnn and #bs = 1: f2 train_valid_loss

bnn with #bs = 5: f2 train_valid_loss

bnn with #bs=10: f2 train_valid_loss

and finally bnn with #bs = 20: f2 train_valid_loss

bnn_emb with #bs=1: f2 train_valid_loss

bnn_emb with #bs=5: f2 train_valid_loss

bnn_emb with #bs=10:

f2 train_valid_loss

bnn_emb with #bs=20: f2 train_valid_loss

The results show that increasing the #bs on imdb decreases the performance significantly, but increases the performance for bnn_emb slightly in terms o auc but decreases the performance on ir-metrics.

For dblp, the results for bnn and bnn_emb for #bs=5 are ready, but for uspt I only have bnn_emb with #bs=5 and not bnn. Since I put experiments on computecanada, I have to specify the time it takes to run the model, but I couldn't predict exactly how much longer it takes with higher #bs, so I put it for 4 days but it threw out of time error, after only complete training of 2 folds! so I put it for 8 days this time but it has not started running after more than a day :( Also, if it needs 8 days for #bs=5, I don't know how much is it going to take for #bs=10 and #bs=20. Even dblp with #bs of 10 and 20 will probably need more than a week as well after the job starts on computecanda. Anyways, I'll put the findings from the completed experiments on dblp and uspt. Here are the results for dblp:

bnn with #bs=1:

f2 train_valid_loss

bnn with #bs=5:

f2 train_valid_loss

bnn_emb with #bs=1: f2 train_valid_loss

bnn_emb with #bs=5:

f2 train_valid_loss

Here again we can see that increasing the #bs from 1 to 5 decreases the model's predictive power on dblp. The results of bnn_emb on uspt with #bs = 1 and 5 showcase the same trend.

uspt's results till now:

bnn_emb with #bs=1:

f2 train_valid_loss

bnn_emb with #bs=5:

f2 train_valid_loss

Here are my final thoughts on these results:

I believe the bnn with #bs=5 on uspt will have lower performance compared to #bs=1 (similar to dblp since they have similar distributions).
Since increasing the #bs from 1 to 5 has already had a negative impact on the model's predictive power on all three datasets, I believe similar to imdb, increasing the #bs will not improve the results but will be in the same range with #bs=5 for dblp and uspt and also since the experiments will take at least 2 weeks for instance for bnn with #bs=20 on dblp and uspt, we should move forward to the next hyperparameter. Please let me know what you think :)

hosseinfani commented 1 year ago

@VaghehDashti Thanks. To have a better visual comparison:

put the results in one table for each dataset
draw the train/valid diagram in one figure for each dataset

this is in contradiction to my knowledge though. Usually increasing bs should have an improving effect up until some point.

VaghehDashti commented 1 year ago

@hosseinfani, The results for imdb:

| P_2 | P_5 | P_10 | rec_2 | rec_5 | rec_10 | ndcg_2 | ndcg_5 | ndcg_10 | map_2 | map_5 | map_10 | aucroc -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- bnn #bs=1 | 0.0080 | 0.0082 | 0.0075 | 0.0035 | 0.0092 | 0.0171 | 0.0080 | 0.0090 | 0.0127 | 0.0027 | 0.0044 | 0.0055 | 0.6429 bnn #bs=5 | 0.0000 | 0.0009 | 0.0009 | 0.0000 | 0.0014 | 0.0028 | 0.0000 | 0.0010 | 0.0017 | 0.0000 | 0.0005 | 0.0007 | 0.5394 bnn #bs=10 | 0.0000 | 0.0000 | 0.0004 | 0.0000 | 0.0000 | 0.0014 | 0.0000 | 0.0000 | 0.0007 | 0.0000 | 0.0000 | 0.0002 | 0.5272 bnn #bs=20 | 0.0000 | 0.0009 | 0.0004 | 0.0000 | 0.0014 | 0.0014 | 0.0000 | 0.0008 | 0.0008 | 0.0000 | 0.0003 | 0.0003 | 0.5287 bnn_emb #bs=1 | 0.0043 | 0.0051 | 0.0064 | 0.0028 | 0.0085 | 0.0196 | 0.0033 | 0.0059 | 0.0114 | 0.0014 | 0.0028 | 0.0044 | 0.5182 bnn_emb #bs=5 | 0.0021 | 0.0017 | 0.0013 | 0.0014 | 0.0028 | 0.0043 | 0.0016 | 0.0020 | 0.0026 | 0.0007 | 0.0010 | 0.0012 | 0.5300 bnn_emb #bs=10 | 0.0043 | 0.0017 | 0.0030 | 0.0028 | 0.0028 | 0.0094 | 0.0043 | 0.0033 | 0.0062 | 0.0021 | 0.0021 | 0.0029 | 0.5311 bnn_emb #bs=20 | 0.0000 | 0.0000 | 0.0004 | 0.0000 | 0.0000 | 0.0014 | 0.0000 | 0.0000 | 0.0006 | 0.0000 | 0.0000 | 0.0002 | 0.5364

the results for dblp:

| P_2 | P_5 | P_10 | rec_2 | rec_5 | rec_10 | ndcg_2 | ndcg_5 | ndcg_10 | map_2 | map_5 | map_10 | aucroc -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- bnn #bs=1 | 0.0006 | 0.0007 | 0.0007 | 0.0004 | 0.0010 | 0.0021 | 0.0005 | 0.0008 | 0.0013 | 0.0002 | 0.0004 | 0.0006 | 0.6352 bnn #bs=5 | 0.0005 | 0.0004 | 0.0004 | 0.0003 | 0.0007 | 0.0013 | 0.0006 | 0.0007 | 0.0009 | 0.0003 | 0.0004 | 0.0005 | 0.5507 bnn_emb #bs=1 | 0.0011 | 0.0013 | 0.0013 | 0.0007 | 0.0019 | 0.0037 | 0.0011 | 0.0016 | 0.0024 | 0.0005 | 0.0008 | 0.0010 | 0.6681 bnn_emb #bs=5 | 0.0004 | 0.0005 | 0.0004 | 0.0003 | 0.0007 | 0.0012 | 0.0005 | 0.0007 | 0.0009 | 0.0002 | 0.0004 | 0.0004 | 0.5640

and the results of bnn_emb with #ns=1&5 on uspt:

| P_2 | P_5 | P_10 | rec_2 | rec_5 | rec_10 | ndcg_2 | ndcg_5 | ndcg_10 | map_2 | map_5 | map_10 | aucroc -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- bnn_emb #bs=1 | 0.0037 | 0.0041 | 0.0037 | 0.0016 | 0.0045 | 0.0081 | 0.0037 | 0.0045 | 0.0061 | 0.0012 | 0.0020 | 0.0026 | 0.6985 bnn_emb #bs=5 | 0.0007 | 0.0008 | 0.0008 | 0.0004 | 0.0010 | 0.0021 | 0.0007 | 0.0009 | 0.0014 | 0.0003 | 0.0004 | 0.0006 | 0.5891

draw the train/valid diagram in one figure for each dataset

I will have to write a code to create the figure you asked for. I will work on it but it will take some time, I will update here later.

this is in contradiction to my knowledge though. Usually increasing bs should have an improving effect up until some point.

I thought the same way and have not been able to come up with a reason till now. I will think about why this is happening more and let you know if I came up with anything. I would appreciate any ideas and help :)

VaghehDashti commented 1 year ago

@hosseinfani, Here are the train/val figures: dblp: it looks like the model is overfitting with #bs=5 (lower training loss and higher valid loss)

l 128 lr0 1 b128 e20 nns3 nsunigram_b f2 train_valid_loss

imdb: here again looks like with #bs=5 it is overfitting but for #bs 10 and 20, the model's training loss does not change much during training and the validation loss fluctuates respectively.

l 128 lr0 1 b128 e20 nns3 nsunigram_b f2 train_valid_loss

uspt: not sure what is going on.

l 128 lr0 1 b128 e20 nns3 nsunigram_b f2 train_valid_loss

Please let me know what you think.

hosseinfani commented 1 year ago

@VaghehDashti Thank you. Please

share the code here or push it to the codeline
make the lines for vali same color with its training pair but dashed
make the figures smaller and use subplots (e.g., 3 by 3) so that in one snapshot we can see the results of all datasets
can you make these figures on average of folds?

I have to look into the code and see what's going on.

VaghehDashti commented 1 year ago

Hi @hosseinfani, I pushed the updated code based on what you said to the misc folder, i.e. they are the average of folds and the train/val are the same color for each #bs. To run the code, you need to cd to src and then run:

python -u misc/report_loss.py

here is the results: l 128 lr0 1 b128 e20 nns3 nsunigram_b train_valid_loss

VaghehDashti commented 1 year ago

Hi @hosseinfani, Here is a brief explanation of how Bayesian neural networks work: In the forward propagation, for each input instance the model samples #bs weights and hence predicts #bs outputs, i.e. for each expert the model predicts #bs probabilities, then averages the #bs probabilities of each expert and uses the average probabilities for calculating the loss for that instance. After #batch_size of instances, the model averages the losses and using that moves forward to the backpropagation step. This is a bottleneck for large datasets such as uspt, where the shape of the weights for the first layer is 2x67315x128 (1 for matrix of mean and 1 for standard deviation). E.g., if #bs=20, for each instance for train set (somewhere around 0.85 * 152317), the model has to sample 2x67315x128 numbers 20 times and this is just the weights edges between input layer and hidden layer. As a result, training can take several weeks. Finally, as we discussed, we're going to use bnn_emb where the matrices will be of size (100,128) and (128,#experts) on dblp and imdb where the #instances are fewer than uspt with #bs up to 20.

Here are the train/val loss of bnn_emb on dblp and imdb with #bs up to 10. After we have the results for #bs = 20, I will update the issue.

l 128 lr0 1 b128 e20 nns3 nsunigram_b train_valid_loss

Performance on dblp:

| P_2 | P_5 | P_10 | rec_2 | rec_5 | rec_10 | ndcg_2 | ndcg_5 | ndcg_10 | map_2 | map_5 | map_10 | aucroc -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- bnn_emb #bs=1 | 0.0011 | 0.0013 | 0.0013 | 0.0007 | 0.0019 | 0.0037 | 0.0011 | 0.0016 | 0.0024 | 0.0005 | 0.0008 | 0.0010 | 0.6681 bnn_emb #bs=5 | 0.0004 | 0.0005 | 0.0004 | 0.0003 | 0.0007 | 0.0012 | 0.0005 | 0.0007 | 0.0009 | 0.0002 | 0.0004 | 0.0004 | 0.5640 bnn_emb #bs=10 | 0.0004 | 0.0004 | 0.0003 | 0.0003 | 0.0006 | 0.0009 | 0.0004 | 0.0006 | 0.0007 | 0.0002 | 0.0003 | 0.0004 | 0.5507

Performance on imdb:

| P_2 | P_5 | P_10 | rec_2 | rec_5 | rec_10 | ndcg_2 | ndcg_5 | ndcg_10 | map_2 | map_5 | map_10 | aucroc -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- bnn_emb #bs=1 | 0.0043 | 0.0051 | 0.0064 | 0.0028 | 0.0085 | 0.0196 | 0.0033 | 0.0059 | 0.0114 | 0.0014 | 0.0028 | 0.0044 | 0.5182 bnn_emb #bs=5 | 0.0021 | 0.0017 | 0.0013 | 0.0014 | 0.0028 | 0.0043 | 0.0016 | 0.0020 | 0.0026 | 0.0007 | 0.0010 | 0.0012 | 0.5300 bnn_emb #bs=10 | 0.0043 | 0.0017 | 0.0030 | 0.0028 | 0.0028 | 0.0094 | 0.0043 | 0.0033 | 0.0062 | 0.0021 | 0.0021 | 0.0029 | 0.5311

Here is an explanation of the results: The model is trying to decrease the loss on the training set and with higher #bs the distribution of weights for bnn(_emb) is overfitting the distribution of training set and hence cannot generalize well on validation/test sets.

Please let me know what you think :)

hosseinfani commented 1 year ago

@VaghehDashti I had a look at the code. Why at valid, we still do 1 sample? https://github.com/fani-lab/OpeNTF/blob/148c1c2defe1176563f162ad159b2ffe0af15ecc/src/mdl/bnn.py#L139

We can do s sample like train so the loss comparison becomes fair. Also, I think we need to change the test code to do average on s preditions.

I ran on dblp toy and here is the result before and after the fix. After the fix, up until some s, we see improvement. Also, for all s, although overfit happens, the valid loss range is lower after the fix.

https://docs.google.com/document/d/1T9uV--4afs3qp0GoSMpOItaqbTsqQ-RjYOiwMKbU9Gk/edit?usp=sharing

VaghehDashti commented 1 year ago

@hosseinfani, I think I followed josh feldman's blog/code when I hard-coded the #bs=1 for validaiton. Unfortunately, I cannot confirm because as you know his website is down. I will look at other implementations of bnn and see if they do the same thing and let you know.

hosseinfani commented 1 year ago

@VaghehDashti no need to check. as seen, the results become better. go ahead with s sampling during validation and averaging during the test.

VaghehDashti commented 1 year ago

@hosseinfani, Sure. I was thinking we can use #bs={3,5,10} instead of {5,10,20} to have the results sooner. What do you think?

hosseinfani commented 1 year ago

@VaghehDashti agree. But first see the result of toy datasets after averaging the predictions on test set.

VaghehDashti commented 1 year ago

@hosseinfani here are the results of bnn with #bs=3 and 20 epochs on toy-dblp without averaging the predictions:

f2 train_valid_loss

with averaging the predictions:

f2 train_valid_loss

The training and validation loss are the same for both cases as expected (both use #bs for validation), but strangely the results are worse when we predict #bs times and average them! I have pushed the code, please review and let me know if it looks okay to you so I can start re-running the experiments on the real datasets.

hosseinfani commented 1 year ago

@VaghehDashti Please debug the code, line by line, see the predictions at each iteration, and find where there is a problem.

VaghehDashti commented 1 year ago

@hosseinfani I just finished debugging the code line by line and I couldn't find any problems. The most important thing that I tested several times is that when predicting #bs outputs in either training/validation/test the weights change every time and have different outputs and all of the shapes are correct.

VaghehDashti commented 1 year ago

@hosseinfani Here are the loss and results of bnn_emb on dblp and imdb with averaging predictions with #bs=3: dblp: f2 train_valid_loss

| P_2 | P_5 | P_10 | rec_2 | rec_5 | rec_10 | ndcg_2 | ndcg_5 | ndcg_10 | map_2 | map_5 | map_10 | aucroc -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- bnn_emb #bs=1 | 0.0011 | 0.0013 | 0.0013 | 0.0007 | 0.0019 | 0.0037 | 0.0011 | 0.0016 | 0.0024 | 0.0005 | 0.0008 | 0.0010 | 0.6681 bnn_emb #bs=3 | 0.0020 | 0.0019 | 0.0018 | 0.0012 | 0.0027 | 0.0054 | 0.0021 | 0.0025 | 0.0037 | 0.0009 | 0.0013 | 0.0017 | 0.6656

imdb:

f2 train_valid_loss

| P_2 | P_5 | P_10 | rec_2 | rec_5 | rec_10 | ndcg_2 | ndcg_5 | ndcg_10 | map_2 | map_5 | map_10 | aucroc -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- bnn_emb #bs=1 | 0.0043 | 0.0051 | 0.0064 | 0.0028 | 0.0085 | 0.0196 | 0.0033 | 0.0059 | 0.0114 | 0.0014 | 0.0028 | 0.0044 | 0.5182 bnn_emb #bs=3 | 0.0021 | 0.0026 | 0.0038 | 0.0014 | 0.0043 | 0.0105 | 0.0026 | 0.0038 | 0.0069 | 0.0014 | 0.0022 | 0.0031 | 0.5264

Uspt's not ready yet. For dblp, the ir-metrics have increased but for imdb, they have decreased. AUC has acted inversely.

The significant drop in performance on toy-dblp didn't happen on either dblp or imdb.

Update: uspt's results are here as well:

f3 train_valid_loss

| P_2 | P_5 | P_10 | rec_2 | rec_5 | rec_10 | ndcg_2 | ndcg_5 | ndcg_10 | map_2 | map_5 | map_10 | aucroc -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- bnn_emb #bs=1 | 0.0011 | 0.0013 | 0.0013 | 0.0007 | 0.0019 | 0.0037 | 0.0011 | 0.0016 | 0.0024 | 0.0005 | 0.0008 | 0.0010 | 0.6681 bnn_emb #bs=3 | 0.0054 | 0.0045 | 0.0039 | 0.0028 | 0.0056 | 0.0094 | 0.0054 | 0.0056 | 0.0074 | 0.0021 | 0.0029 | 0.0034 | 0.6844

uspt has similar behavior to dblp as expected.

hosseinfani commented 1 year ago

@hosseinfani I just finished debugging the code line by line and I couldn't find any problems. The most important thing that I tested several times is that when predicting #bs outputs in either training/validation/test the weights change every time and have different outputs and all of the shapes are correct.

I know. But you can pick two random test instances and print the predictions for each #bs iteration . Then calculate the average and std to see if std is wide or narrow. It should be narrow I think.

hosseinfani commented 1 year ago

Please explain the reason behind the sudden drop in loss after the epoch here by mentioning the codline.

VaghehDashti commented 1 year ago

I know. But you can pick two random test instances and print the predictions for each #bs iteration . Then calculate the average and std to see if std is wide or narrow. It should be narrow I think.

@hosseinfani, The standard deviations are between 0.005 and 0.46 for each expert after 10 epochs on toy-dblp.

VaghehDashti commented 1 year ago

Please explain the reason behind the sudden drop in loss after the epoch here by mentioning the codline.

@hosseinfani, I believe the drop in loss is due to this line where we decrease the learning rate when the validation loss does not change significantly after 10 epochs.

https://github.com/fani-lab/OpeNTF/blob/148c1c2defe1176563f162ad159b2ffe0af15ecc/src/mdl/bnn.py#L111

hosseinfani commented 1 year ago

I know. But you can pick two random test instances and print the predictions for each #bs iteration . Then calculate the average and std to see if std is wide or narrow. It should be narrow I think.

@hosseinfani, The standard deviations are between 0.005 and 0.46 for each expert after 10 epochs on toy-dblp.

can you draw min-max-avg chart, x: experts, y: prob, sorted on decreasing avg prob: sth like this:

hosseinfani commented 1 year ago

Please explain the reason behind the sudden drop in loss after the epoch here by mentioning the codline.

@hosseinfani, I believe the drop in loss is due to this line where we decrease the learning rate when the validation loss does not change significantly after 10 epochs.

https://github.com/fani-lab/OpeNTF/blob/148c1c2defe1176563f162ad159b2ffe0af15ecc/src/mdl/bnn.py#L111

can you try running on patience=2 but same 20 epochs on imdb or any dataset which gives you results faster?

VaghehDashti commented 1 year ago

can you try running on patience=2 but same 20 epochs on imdb or any dataset which gives you results faster?

@hosseinfani, Sure, I put it on imdb. It will take ~1.5 days. I'll update here.

can you draw min-max-avg chart, x: experts, y: prob, sorted on decreasing avg prob: sth like this:

Sure, I will work on it.

VaghehDashti commented 1 year ago

@hosseinfani Here is the plot, although it is not sorted by decreasing avg probability I think this will still give you the information you need.

This is based on the predictions on 1 instance with #bs=5: f2 test min-max-avg-plot

Here is the average on 5 instances of test set for toy-dblp with #bs=5: f2 test min-max-avg-plot

hosseinfani commented 1 year ago

@VaghehDashti Awesome. Can we have the second one for #bs=3, 10 also? If we could show that the larger the bs, the wider the min-max, then we explain the reason why larger #bs leads to the poor test result.

VaghehDashti commented 1 year ago

@hosseinfani Sorry that was for #bs = 3. Here is for #bs = 5:

f2 test min-max-avg-plot

bs=10:

f2 test min-max-avg-plot

bs=50:

f2 test min-max-avg-plot

As you said we can see why high #bs can lead to lower performance.

fani-lab / OpeNTF

Hyperparameter Study for neural models #179

The results show that increasing the #bs on imdb decreases the performance significantly, but increases the performance for bnn_emb slightly in terms o auc but decreases the performance on ir-metrics.

bs=10:

bs=50: