Inference of Development Set

kingformatty commented 3 years ago

Hi Andrew, Thanks for the coding sharing, which really helps me a lot in my research.

I am trying to reproduce the DepAudioNet results shown in Table 2. Here I have a question regarding inference of development set.

As you stated in README, the results on Dataset's Validation Set are

F1: 0.725(ND) 0.52(D) 0.62(Avg),

this result is same as second row in Table 2 and I reproduce it with similar (but not precisely confirmed ) results as

F1: 0.74(ND) 0.59(D) 0.67(Avg)

The following are questions:

Is the results you have in README obtained from running test script ? Or just find the best performance among 5 models (5 is exp_runthrough) from training log directly without test the validation again ?
If the inference is done by running test script, then there is a problem with generator declaration in row #1180 in main1.py. If data_type is set to "test" during testing, and SPLIT_BY_GENDER is set to True, then there are male and female generator in "generators" in row #1120, which will cause the following error:

Current directory exists but experiment not finished Loading from checkpoint: 28 The dimensions of the test features are: (7993, 40, 120) The number of class zero and one files in the test split after segmentation are 5710, 2283 Traceback (most recent call last): File "main1.py", line 1359, in test() File "main1.py", line 1185, in test generator, UnboundLocalError: local variable 'generator' referenced before assignment

Is there any specific reason that the model is not evaluated on test set ? If the above implementation of mine correct, the performance on test set is not quite good using majority voting, which are

F1: 0.68(ND) 0.38(D) 0.53(Avg)

Please clarify the above questions when you are free. Our group might use this code for a long time for further study, I will be appreciate if the problems can be resolved.

Thank you very much.

Best Jinhan

adbailey1 commented 3 years ago

Hi Jinhan,

My pleasure!

1) I obtained the results by training my model from scratch (running it a total of 5 times to average out variances in the starting weights), then, in the final log file, the last list shows all "best" outputs from all the models trained where the very last line is the averaged score.

I then tested this by running in test mode: python main1.py test --validate --cuda --vis.

2) I am not getting this error. I just tested my best model with 'SPLIT_BY_GENDER': False and 'SPLIT_BY_GENDER': True, again using the same code as above: python main1.py test --validate --cuda --vis.

NOTE: I think I have figured out why this isn't working, you were running the models on the test set rather than the validation. I did not have the labels for the test set when I created this tool as the AVEC competition organisers were still running the competition. So the code was supposed to work in the instance of no labels being present. However, I have now pushed updates for this tool and you should be able to run either iteration of 'SPLIT_BY_GENDER' in test mode with labels.

3) I never tested on the actual test set as labels were not available due to the AVEC competition still running. However, if you notice a drop in performance in the actual test set this is to be expected due to the limited size of the dataset. Also, we are in fact overfitting our model to the validation set when we build and adapt our models around the performance of a static validation set, which is why k-cross fold validation is preferred. I haven't implemented cross fold validation here as the DepAudioNet paper does not use them. However, I do have some old code that did this for this repo, which I can share if that's helpful.

Hope this helps, let me know if there is anything else I can help with.

Andrew

kingformatty commented 3 years ago

Thank you so much for your immediate reply. I appreciate that.

Here I have another question after your commit. Previously I was stick with the majority voting inference. But seems like the majority voting from models are discarded from your last commit, instead, the current performance evaluation during test is averaging all scores from "5" models. So my question is, which way were you using to get the results shown in paper and README.

Looking forward to your reply. Thank you very much.

adbailey1 commented 3 years ago

Great question.

Ok for my paper I was testing the variation of the model, which is why I repeated my experiment 5 times. In each iteration of training, I performed majority vote on every validation file to obtain my final results. These results were used to pick the best epoch. The experiment was run another 4 times in the same way. This meant that I had a best epoch from all 5 iterations, the best epoch has an accuracy and F-score etc., I averaged these to get the model's performance for the paper. Again, this was all done on the validation set, not the test set.

This isn't practical for real world applications though (for example, using the test set or real-time application) so here we want to either select the single best model and run the test set through that, or we want to push the test set through all 5 iterations of our model and somehow obtain a single result for each file (as you mentioned I was previously using majority vote). But you could use averaging. I have updated the code to add this in for test mode and running only on the test data.

Obviously, this is just a tool for anyone to use, so please feel free to download it and mess around with it to fit your experiments.

Hope this is helpful and again, let me know if there is anything else I can help with.

Andrew

kingformatty commented 3 years ago

Thank you very much for your kind clarification and feedback. I will close the issue.

adbailey1 / DepAudioNet_reproduction

Inference of Development Set #4