Conversation Classification

hamedwaezi01 commented 1 year ago

In this thread you can find the progress related to the "Conversation Classification" problem. You may find different approaches like classic ML, Feedforward, CNN, RNN, LSTM, GRU etc. here. I will try to cover other aspects of the project, like preprocessing and previous works, in different issues.

hamedwaezi01 commented 1 year ago

For the last few days, I have been getting some suspiciously wrong metric values and have tried to figure out the reason. I thought it was the model configuration itself but later realized the library that was used for the calculation was the problem. I fixed this issue by calculating the metrics, including accuracy, recall, precision, and area under the ROC curve manually.

Additionally, I updated the K-fold cross-validation to Stratified K-fold cross-validation; the former splits the data without keeping the distribution of labels, while the latter considers the label distribution.

You can find some codes that make the datasets balanced according to the specified ratio of positive over negative. I uploaded a balanced dataset with a ratio of 0.4 to the Teams channel of the Osprey Project.

Also, according to the stats that were retrieved earlier, there was a demand to filter undesired records. Right now, one of the datasets passes only conversations with exactly 2 chatters and with more than 3 messages. Any new dataset can easily apply this feature by overriding the BaseDataset.filter_records method.

hosseinfani commented 1 year ago

@hamedwaezi01 as we talked today, please use a well-known library for metric calculation like scikit or pytrec, ... Nobdy rely on our code for calculating metrics, which is very critical point of evaluation and research.

hamedwaezi01 commented 1 year ago

as we talked, I ran some mock prediction-target pairs and updated the code. it is now working.

hosseinfani commented 1 year ago

@hamedwaezi01 what does "working" mean :D Please put some results/figures to show that.

hamedwaezi01 commented 1 year ago

I used the torchmetrics lib again but this time, I ran some small mock prediction-target, like extreme cases where everything is wrong or right or a single entry is wrong, and got the expected results.

hamedwaezi01 commented 1 year ago

I currently experimented with 3 models (Feedforward, RNN, CNN) on a balanced (predatory/all == 0.4) toy test. I preserved the cross-validation splits and ran the same records on each model. We expect that the models have similar relative performance if we run them on the real dataset. The resulting metrics for each model on the test are as follows:

Feedforward: AUCROC: 0.7933513 | AUCPR: 0.6991223 | accuracy: 0.4496788 | precision: 0.9438503 | recall: 0.4172577
Simple RNN: AUCROC: 0.5137964 | AUCPR: 0.6405249 | accuracy: 0.6070664 | precision: 0.0213904 | recall: 0.8888889
Simple CNN: AUCROC: 0.5048176 | AUCPR: 0.4587629 | accuracy: 0.4004283 | precision: 1.0000000 | recall: 0.4004283

As we expected, Feedforward outperformed the other two models. I believe it is because the hyperparameters for RNN and CNN should be tuned. Other than RNN, we can experiment with its more complex siblings, like LSTM and GRU.

NOTES: You can find the debug-level logs of this experiment under this file: logs/persist/05-09-2023-13-26-41.log

hamedwaezi01 commented 1 year ago

There is a problem with RNN giving nan values (this issue), and I am not experimenting with the model on the balanced NON-toy dataset to see if it helps or not. I also played with how the loss value is applied (using BCELossWithLogits, for example, to be numerically more stable); I still got a nan in one run. I will let you know about the RNN problem. If you have any ideas, please drop a comment here or on the another issue for RNN.

hosseinfani commented 1 year ago

@hamedwaezi01 thanks for the update.

About the results on toy.balanced.ds, think of drawing the roc or pr curve as you can see the pr/rec trade-off in rnn baseline.
For hyperparam tuning, you can use bo: https://towardsdatascience.com/mango-a-new-way-to-make-bayesian-optimisation-in-python-a1a09989c6d8
I had same problem in opentf. I log the solution here: https://github.com/fani-lab/OpeNTF/issues/4 Also, look at the code changes. Hope it helps.

hamedwaezi01 commented 1 year ago

Thanks, I have been trying some configurations, and before implementing big changes, I thought of trying different optimizers. Since the RNN does not work well with the toy set, I am running it on a larger balanced dataset; up until now, the model is learning according to recall and precision metrics, but they could be more stable.

About the curves, I am saving them, and I will post them in future updates.

Another thing about the built-in PyTorch RNN is that it's very slow. I plan to implement a basic but faster RNN model soon.

hosseinfani commented 1 year ago

@hamedwaezi01 thanks for the update. we need to talk. don't go with rnn implementation please. find why it's slow. does it engage gpu? we don't reinvent the wheel unless it's absolute neccessary 😀

hamedwaezi01 commented 1 year ago

Actually, it does use the dedicated GPU almost to the fullest extent. The recurrent-based models do not support sparse matrix, and I think it is affecting the speed. BTW, my first run of RNN with Adam optimizer is finished. The evaluation metrics for the test set are: test set -> AUCROC: 0.8691356 | AUCPR: 0.8675280 | accuracy: 0.8809677 | precision: 0.7524753 | recall: 0.9376459 NOTE: Let's not forget that the datasets are balanced with a ratio of 0.4 (predatory/all) You can find the logs here. The figures for precision-recall and TPR-FPR curves are as follows:

precision-recall-curve ROC-curve

hamedwaezi01 commented 1 year ago

I ran the LSTM module yesterday. I used an Adam optimizer and ReduceLROnPlateau scheduler. The log file for this session can be found here, and you can find the parameters for optimizer and scheduler. The test results for this session and the previous RNN session are as follows respectively: LSTM test set -> AUCROC: 0.9208137 | AUCPR: 0.9174601 | accuracy: 0.8924213 | precision: 0.9135670 | recall: 0.8334961 RNN test set -> AUCROC: 0.8691356 | AUCPR: 0.8675280 | accuracy: 0.8809677 | precision: 0.7524753 | recall: 0.9376459 The difference between the LSTM and RNN is the loss values figure and higher AUCROC value. I am unsure if it is good, but the loss value did not fluctuate as much as that of RNN.

Simultaneously, I read and learned more about the transformers and attention mechanisms while the model ran.

Right now, I am running the GRU. A couple of forums noted that LSTM usually outperforms GRU. So it is more of an experiment.

precision-recall-curve

ROC-curve

Following is the loss value figure per epoch for the best fold of this session: loss_f2

hosseinfani commented 1 year ago

@hamedwaezi01 thanks for the update. looks we're in the right track. just a quick issue: the loss values for fold2 (last figure), shouldn't the legends be switched?

hamedwaezi01 commented 1 year ago

Regarding the legends, I think you were concerned about their slope or their value; am I correct? If these are the cases, the legends are correct.

Also, I want to report the GRU, but before that, I should rerun the LSTM again so I can compare their results in a better way.

hosseinfani commented 1 year ago

@hamedwaezi01 how the training samples (red) that are seen by the model has larger loss compared to validation (unseen) data?

hamedwaezi01 commented 1 year ago

The loss function uses sum as the reduction method. And because of the cross-validation, the size of the validation set is way smaller than the training set. It all results in validation loss being smaller than that of the training set.

hosseinfani commented 1 year ago

@hamedwaezi01 so, they are not comparable this way. can you make them comparable by averaging or sth?

hamedwaezi01 commented 1 year ago

Yes, we can average them. I usually compare the slope of the figures. But it makes sense to normalize them. It will give more insights. I will do it for future sessions.

We can easily see that LSTM and GRU have better results almost in each metric. Despite the validation-training losses of these two models showing some overfitting, they outperformed the RNN model. This issue might be handled by regularization and similar approaches. I am trying to run other variants of the dataset (the ones with a ratio of 0.3, for example), but I am facing some resource-related problems. I might have to run it on a server or something. I will make sure to keep you posted.

Meanwhile, I would look into outlier detection literature as we have spoken before.

Figures of loss per epoch in the training phase of each chosen model GRU loss_f2

LSTM loss_f4

simple RNN loss_f3

hosseinfani commented 1 year ago

@hamedwaezi01 now, the results and figures make sense :) the last figure (f3) is weird tho.

hamedwaezi01 commented 1 year ago

Yeah they used the old loss reduction (sum and not mean). I am facing a big problem with the recurrent networks now. Because the training set gets bigger when the ratio of predatory/non-predatory gets smaller, the time for running each epoch significantly increases. Right now, an LSTM model takes 10 minutes to complete one epoch. I assume running the code on one of the university servers can be helpful. Can you please let me know how I can get access?

hamedwaezi01 commented 1 year ago

I noticed I did not update the LSTM with the feature vector that includes message time as context. I used the balanced dataset of ratio 0.4 here. For the hidden layer size of LSTM, I used two values, 1024 and 2056, and you can see the result of each as follows, respectively: test set -> AUCROC: 0.9776518 | AUCPR: 0.9626508 | accuracy: 0.9324556 | recall: 0.8782270 | precision: 0.9649451 | f2score: 0.8943008 test set -> AUCROC: 0.9690026 | AUCPR: 0.9462874 | accuracy: 0.9052665 | recall: 0.8279669 | precision: 0.9633396 | f2score: 0.9328358

It seems that increasing the hidden layer size is helping the model with learning. It is notable that the feature vector size was 13000 for each of these sessions, where 1 feature was used for storing time, and a feature was used for non-defined tokens.

Also, comparing the result of the temporal feature vectors and regular token-only features, we can see the F2 score improved (the F2 score of the latter is 0.8402877). I think the regular token-only features were a bit overfitted; maybe a session with smaller epochs yields better results.

The logs for the sessions mentioned above can be found here and here, respectively.

NOTE: I also updated the KFold best model criteria to F2 score since it makes more sense for us.

The loss-epoch figure of LSTM with a hidden layer of size 1024 loss_f4

The loss-epoch figure of LSTM with a hidden layer of size 2056 loss_f1

hosseinfani commented 1 year ago

@hamedwaezi01 good job!

hamedwaezi01 commented 1 year ago

The session for running LSTM against the real dataset is finished, and you can find its log here. I applied a filter on the training set before cross-validation. I dropped the conversations that had less or more than 2 participants.

Although the metrics in validation were much better, we can see they are not performing as expected on the test set. A high accuracy and low recall mean the model predicts most of the records as negative. By looking at the loss-epoch figure and the negative predictions, we can say the model is overfitted on the negative class. Probably applying a regularization or dropout alongside more epochs can help the model learn better. loss_f1

The figures of the Precision-Recall curve and ROC curve, and the area under them, suggest some insights on using an appropriate metric when handling imbalanced datasets. I will write about it here soon. precision-recall-curve ROC-curve

fani-lab / Osprey

Conversation Classification #30