Open hamedwaezi01 opened 1 year ago
For the last few days, I have been getting some suspiciously wrong metric values and have tried to figure out the reason. I thought it was the model configuration itself but later realized the library that was used for the calculation was the problem. I fixed this issue by calculating the metrics, including accuracy, recall, precision, and area under the ROC curve manually.
Additionally, I updated the K-fold cross-validation to Stratified K-fold cross-validation; the former splits the data without keeping the distribution of labels, while the latter considers the label distribution.
You can find some codes that make the datasets balanced according to the specified ratio of positive over negative. I uploaded a balanced dataset with a ratio of 0.4 to the Teams channel of the Osprey Project.
Also, according to the stats that were retrieved earlier, there was a demand to filter undesired records. Right now, one of the datasets passes only conversations with exactly 2 chatters and with more than 3 messages. Any new dataset can easily apply this feature by overriding the BaseDataset.filter_records
method.
@hamedwaezi01 as we talked today, please use a well-known library for metric calculation like scikit or pytrec, ... Nobdy rely on our code for calculating metrics, which is very critical point of evaluation and research.
as we talked, I ran some mock prediction-target pairs and updated the code. it is now working.
@hamedwaezi01 what does "working" mean :D Please put some results/figures to show that.
I used the torchmetrics lib again but this time, I ran some small mock prediction-target, like extreme cases where everything is wrong or right or a single entry is wrong, and got the expected results.
I currently experimented with 3 models (Feedforward, RNN, CNN) on a balanced (predatory/all == 0.4) toy test. I preserved the cross-validation splits and ran the same records on each model. We expect that the models have similar relative performance if we run them on the real dataset. The resulting metrics for each model on the test are as follows:
Feedforward: AUCROC: 0.7933513 | AUCPR: 0.6991223 | accuracy: 0.4496788 | precision: 0.9438503 | recall: 0.4172577
Simple RNN: AUCROC: 0.5137964 | AUCPR: 0.6405249 | accuracy: 0.6070664 | precision: 0.0213904 | recall: 0.8888889
Simple CNN: AUCROC: 0.5048176 | AUCPR: 0.4587629 | accuracy: 0.4004283 | precision: 1.0000000 | recall: 0.4004283
As we expected, Feedforward outperformed the other two models. I believe it is because the hyperparameters for RNN and CNN should be tuned. Other than RNN, we can experiment with its more complex siblings, like LSTM and GRU.
NOTES: You can find the debug-level logs of this experiment under this file: logs/persist/05-09-2023-13-26-41.log
There is a problem with RNN giving nan values (this issue), and I am not experimenting with the model on the balanced NON-toy dataset to see if it helps or not. I also played with how the loss value is applied (using BCELossWithLogits, for example, to be numerically more stable); I still got a nan in one run. I will let you know about the RNN problem. If you have any ideas, please drop a comment here or on the another issue for RNN.
@hamedwaezi01 thanks for the update.
Thanks, I have been trying some configurations, and before implementing big changes, I thought of trying different optimizers. Since the RNN does not work well with the toy set, I am running it on a larger balanced dataset; up until now, the model is learning according to recall and precision metrics, but they could be more stable.
About the curves, I am saving them, and I will post them in future updates.
Another thing about the built-in PyTorch RNN is that it's very slow. I plan to implement a basic but faster RNN model soon.
@hamedwaezi01 thanks for the update. we need to talk. don't go with rnn implementation please. find why it's slow. does it engage gpu? we don't reinvent the wheel unless it's absolute neccessary 😀
Actually, it does use the dedicated GPU almost to the fullest extent. The recurrent-based models do not support sparse matrix, and I think it is affecting the speed.
BTW, my first run of RNN with Adam optimizer is finished. The evaluation metrics for the test set are:
test set -> AUCROC: 0.8691356 | AUCPR: 0.8675280 | accuracy: 0.8809677 | precision: 0.7524753 | recall: 0.9376459
NOTE: Let's not forget that the datasets are balanced with a ratio of 0.4 (predatory/all)
You can find the logs here. The figures for precision-recall and TPR-FPR curves are as follows:
I ran the LSTM module yesterday. I used an Adam optimizer and ReduceLROnPlateau scheduler. The log file for this session can be found here, and you can find the parameters for optimizer and scheduler. The test results for this session and the previous RNN session are as follows respectively:
LSTM test set -> AUCROC: 0.9208137 | AUCPR: 0.9174601 | accuracy: 0.8924213 | precision: 0.9135670 | recall: 0.8334961
RNN test set -> AUCROC: 0.8691356 | AUCPR: 0.8675280 | accuracy: 0.8809677 | precision: 0.7524753 | recall: 0.9376459
The difference between the LSTM and RNN is the loss values figure and higher AUCROC value. I am unsure if it is good, but the loss value did not fluctuate as much as that of RNN.
Simultaneously, I read and learned more about the transformers and attention mechanisms while the model ran.
Right now, I am running the GRU. A couple of forums noted that LSTM usually outperforms GRU. So it is more of an experiment.
Following is the loss value figure per epoch for the best fold of this session:
@hamedwaezi01 thanks for the update. looks we're in the right track. just a quick issue: the loss values for fold2 (last figure), shouldn't the legends be switched?
Regarding the legends, I think you were concerned about their slope or their value; am I correct? If these are the cases, the legends are correct.
Also, I want to report the GRU, but before that, I should rerun the LSTM again so I can compare their results in a better way.
@hamedwaezi01 how the training samples (red) that are seen by the model has larger loss compared to validation (unseen) data?
The loss function uses sum
as the reduction method. And because of the cross-validation, the size of the validation set is way smaller than the training set. It all results in validation loss being smaller than that of the training set.
@hamedwaezi01 so, they are not comparable this way. can you make them comparable by averaging or sth?
Yes, we can average them. I usually compare the slope of the figures. But it makes sense to normalize them. It will give more insights. I will do it for future sessions.
Results of all the recurrent models:
GRU: test set -> AUCROC: 0.9441823 | AUCPR: 0.8956320 | accuracy: 0.8976665 | precision: 0.9545090 | recall: 0.8194349
LSTM: test set -> AUCROC: 0.9675696 | AUCPR: 0.9487866 | accuracy: 0.8999144 | precision: 0.9753813 | recall: 0.8121658
RNN: test set -> AUCROC: 0.8499330 | AUCPR: 0.8026983 | accuracy: 0.8600942 | precision: 0.8142895 | recall: 0.8323304
We can easily see that LSTM and GRU have better results almost in each metric. Despite the validation-training losses of these two models showing some overfitting, they outperformed the RNN model. This issue might be handled by regularization and similar approaches. I am trying to run other variants of the dataset (the ones with a ratio of 0.3, for example), but I am facing some resource-related problems. I might have to run it on a server or something. I will make sure to keep you posted.
Meanwhile, I would look into outlier detection literature as we have spoken before.
Figures of loss per epoch in the training phase of each chosen model
GRU
LSTM
simple RNN
@hamedwaezi01 now, the results and figures make sense :) the last figure (f3) is weird tho.
Yeah they used the old loss reduction (sum and not mean). I am facing a big problem with the recurrent networks now. Because the training set gets bigger when the ratio of predatory/non-predatory gets smaller, the time for running each epoch significantly increases. Right now, an LSTM model takes 10 minutes to complete one epoch. I assume running the code on one of the university servers can be helpful. Can you please let me know how I can get access?
I noticed I did not update the LSTM with the feature vector that includes message time as context.
I used the balanced dataset of ratio 0.4 here. For the hidden layer size of LSTM, I used two values, 1024 and 2056, and you can see the result of each as follows, respectively:
test set -> AUCROC: 0.9776518 | AUCPR: 0.9626508 | accuracy: 0.9324556 | recall: 0.8782270 | precision: 0.9649451 | f2score: 0.8943008
test set -> AUCROC: 0.9690026 | AUCPR: 0.9462874 | accuracy: 0.9052665 | recall: 0.8279669 | precision: 0.9633396 | f2score: 0.9328358
It seems that increasing the hidden layer size is helping the model with learning. It is notable that the feature vector size was 13000 for each of these sessions, where 1 feature was used for storing time, and a feature was used for non-defined tokens.
Also, comparing the result of the temporal feature vectors and regular token-only features, we can see the F2 score improved (the F2 score of the latter is 0.8402877). I think the regular token-only features were a bit overfitted; maybe a session with smaller epochs yields better results.
The logs for the sessions mentioned above can be found here and here, respectively.
NOTE: I also updated the KFold best model criteria to F2 score since it makes more sense for us.
The loss-epoch figure of LSTM with a hidden layer of size 1024
The loss-epoch figure of LSTM with a hidden layer of size 2056
@hamedwaezi01 good job!
The session for running LSTM against the real dataset is finished, and you can find its log here. I applied a filter on the training set before cross-validation. I dropped the conversations that had less or more than 2 participants.
test set -> AUCROC: 0.5824065 | AUCPR: 0.0380256 | accuracy: 0.9312181 | recall: 0.0757557 | precision: 0.1656409 | f2score: 0.1338726
Although the metrics in validation were much better, we can see they are not performing as expected on the test set. A high accuracy and low recall mean the model predicts most of the records as negative.
By looking at the loss-epoch figure and the negative predictions, we can say the model is overfitted on the negative class. Probably applying a regularization or dropout alongside more epochs can help the model learn better.
The figures of the Precision-Recall curve and ROC curve, and the area under them, suggest some insights on using an appropriate metric when handling imbalanced datasets. I will write about it here soon.
In this thread you can find the progress related to the "Conversation Classification" problem. You may find different approaches like classic ML, Feedforward, CNN, RNN, LSTM, GRU etc. here. I will try to cover other aspects of the project, like preprocessing and previous works, in different issues.