Training and validation accuracy multi label classification

itscrimsonaut commented 2 years ago

Hi, thank you IndoNLU team for making this indobert model. I'm currently working on thesis with this IndoBERT for BertForMultiLabelClassification Task.

I have successfully run the "finetune_casa.ipynb" provided in the examples folder.

I used a private dataset and adapted it to the one in AspectBasedSentimentAnalysisAiryDataset on utils/data_utils.py . The dataset I use is imbalance.

However, after I visualize using matplotlib, the result accuracy between train and eval is much different and the accuracy of eval tends to be static. I have also tried to do something similar using the dataset that has been provided in dataset/casa_absa-prosa however, the results are also not much different from the dataset that I use. this is the code used for fine tuning :

train_loss_lists = []
train_acc_lists = []
eval_loss_lists = []
eval_acc_lists = []

# Train
n_epochs = 8
for epoch in range(n_epochs):
    model.train()
    torch.set_grad_enabled(True)

    total_train_loss = 0
    list_hyp, list_label = [], []

    train_pbar = tqdm(train_loader, leave=True, total=len(train_loader))
    for i, batch_data in enumerate(train_pbar):
        # Forward model
        loss, batch_hyp, batch_label = forward_sequence_multi_classification(model, batch_data[:-1], i2w=i2w, device='cuda')

        # Update model
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        tr_loss = loss.item()
        total_train_loss = total_train_loss + tr_loss

        # Calculate metrics
        list_hyp += batch_hyp
        list_label += batch_label

        train_pbar.set_description("(Epoch {}) TRAIN LOSS:{:.4f} LR:{:.8f}".format((epoch+1),
            total_train_loss/(i+1), get_lr(optimizer)))

    # Calculate train metric
    metrics = absa_metrics_fn(list_hyp, list_label)
    print("(Epoch {}) TRAIN LOSS:{:.4f} {} LR:{:.8f}".format((epoch+1),
        total_train_loss/(i+1), metrics_to_string(metrics), get_lr(optimizer)))
    train_acc_lists.append(metrics['ACC'])
    current_train_loss = round(total_train_loss/(i+1), 4)
    train_loss_lists.append(current_train_loss)

    # Evaluate on validation
    model.eval()
    torch.set_grad_enabled(False)

    total_loss, total_correct, total_labels = 0, 0, 0
    list_hyp, list_label = [], []

    pbar = tqdm(valid_loader, leave=True, total=len(valid_loader))
    for i, batch_data in enumerate(pbar):
        batch_seq = batch_data[-1]        
        loss, batch_hyp, batch_label = forward_sequence_multi_classification(model, batch_data[:-1], i2w=i2w, device='cuda')

        # Calculate total loss
        valid_loss = loss.item()
        total_loss = total_loss + valid_loss

        # Calculate evaluation metrics
        list_hyp += batch_hyp
        list_label += batch_label
        metrics = absa_metrics_fn(list_hyp, list_label)

        pbar.set_description("VALID LOSS:{:.4f} {}".format(total_loss/(i+1), metrics_to_string(metrics)))

    metrics = absa_metrics_fn(list_hyp, list_label)
    print("(Epoch {}) VALID LOSS:{:.4f} {}".format((epoch+1),
        total_loss/(i+1), metrics_to_string(metrics)))
    eval_acc_lists.append(metrics['ACC'])
    current_eval_loss = round(total_loss/(i+1), 4)
    eval_loss_lists.append(current_eval_loss)

The result of matplotlib look like this : unknown

Obviously there's some issue with how this is checked, but I can't put my finger on it. Is there anything I can check?

SamuelCahyawijaya commented 2 years ago

Hi @exzt, first of all, if the label is imbalanced, it will be better to use F1 or other evaluation metrics that are less skewed.

Regarding the issue, before getting into the accuracy, I feel quite unsure about the loss curve, does the loss on the first few steps drop significantly? One would expect a steep downward trend of the training loss, especially on the early steps/epochs during the fine-tuning. Do you observe the same behavior on the early steps of the fine-tuning?

Additionally, based on the train loss, it seems like something is a bit off (either the hyperparameter settings or other problems). You can first try to overfit the model, perhaps you can try to turn off all the drop-out using model.eval() before each training step and try to tune the model and see whether you can reach 0 training loss.

If you can achieve 0 training loss, then you can try to reapply the drop-out and other regularizers to achieve a better generalization of the model such that it can perform well in both the validation and the test sets. If you cannot achieve 0 training loss, I would suggest checking your data (or code), there are probably some problems that hinder the model to learn well on the training set.

Hope it helps!

itscrimsonaut commented 2 years ago

Hi @SamuelCahyawijaya, thank you for your response.

As you mentioned earlier, I have changed acc to F1 and saw that at the beginning of each training there was a decrease in loss but for continuation sometimes there was a spike in the loss curve and in the end, the result of the loss also does not reach 0.

I apologize as I am a newbie in python especially, here I am still confused about what you are saying by doing overfit model, and turn off all the drop-out using model.eval() before each training step in my case.

Can you help me in doing it? Thank you in advance!

cynthiasbahri commented 2 years ago

Hi, indoNLU. Thank you for making this indobert model. I am working on my coursework related to bert model on multilabel data too. I found this model very helpful. However, when I execute the code in "finetune_casa.ipynb", I found something odd in the accuracy of the valid data, where the value remains on the same line. Screenshot (463)_LI Screenshot (462)_LI

SamuelCahyawijaya commented 2 years ago

@exzt @cynthiasbahri : Sorry for the late reply and thank you for your interest in using IndoNLU. I just updated the requirements.txt to ensure the multi_label_classification.py works with the newer transformers and pytorch version. I try running our dataset and it seems to work fine for our dataset. You can see the log as follows:

== Training ==

(Epoch 1) TRAIN LOSS:4.0579 ACC:0.77 F1:0.30 REC:0.34 PRE:0.33 LR:0.00001000
(Epoch 2) TRAIN LOSS:3.0701 ACC:0.81 F1:0.42 REC:0.41 PRE:0.77 LR:0.00001000
(Epoch 3) TRAIN LOSS:2.0327 ACC:0.90 F1:0.69 REC:0.64 PRE:0.87 LR:0.00001000
(Epoch 4) TRAIN LOSS:1.4478 ACC:0.94 F1:0.82 REC:0.78 PRE:0.89 LR:0.00001000
(Epoch 5) TRAIN LOSS:1.1096 ACC:0.97 F1:0.90 REC:0.88 PRE:0.93 LR:0.00001000
(Epoch 6) TRAIN LOSS:0.8669 ACC:0.98 F1:0.94 REC:0.92 PRE:0.96 LR:0.00001000
(Epoch 7) TRAIN LOSS:0.7121 ACC:0.98 F1:0.96 REC:0.95 PRE:0.97 LR:0.00001000
(Epoch 8) TRAIN LOSS:0.5732 ACC:0.99 F1:0.97 REC:0.97 PRE:0.98 LR:0.00001000
(Epoch 9) TRAIN LOSS:0.4943 ACC:0.99 F1:0.98 REC:0.97 PRE:0.98 LR:0.00001000
(Epoch 10) TRAIN LOSS:0.4220 ACC:0.99 F1:0.98 REC:0.98 PRE:0.99 LR:0.00001000
(Epoch 11) TRAIN LOSS:0.3636 ACC:1.00 F1:0.99 REC:0.98 PRE:0.99 LR:0.00001000
(Epoch 12) TRAIN LOSS:0.3158 ACC:1.00 F1:0.99 REC:0.99 PRE:0.99 LR:0.00001000
(Epoch 13) TRAIN LOSS:0.2817 ACC:1.00 F1:0.99 REC:0.99 PRE:0.99 LR:0.00001000
(Epoch 14) TRAIN LOSS:0.2512 ACC:1.00 F1:0.99 REC:0.99 PRE:1.00 LR:0.00001000
(Epoch 15) TRAIN LOSS:0.2293 ACC:1.00 F1:1.00 REC:0.99 PRE:1.00 LR:0.00001000

== Validation ==

(Epoch 1) VALID LOSS:3.4720 ACC:0.79 F1:0.29 REC:0.33 PRE:0.26
(Epoch 2) VALID LOSS:2.3207 ACC:0.87 F1:0.63 REC:0.57 PRE:0.88
(Epoch 3) VALID LOSS:1.7763 ACC:0.92 F1:0.79 REC:0.73 PRE:0.87
(Epoch 4) VALID LOSS:1.5538 ACC:0.93 F1:0.84 REC:0.80 PRE:0.89
(Epoch 5) VALID LOSS:1.4540 ACC:0.93 F1:0.84 REC:0.80 PRE:0.90
(Epoch 6) VALID LOSS:1.3484 ACC:0.93 F1:0.84 REC:0.80 PRE:0.89
(Epoch 7) VALID LOSS:1.3041 ACC:0.94 F1:0.86 REC:0.82 PRE:0.90
(Epoch 8) VALID LOSS:1.2512 ACC:0.93 F1:0.84 REC:0.80 PRE:0.89
(Epoch 9) VALID LOSS:1.1802 ACC:0.93 F1:0.82 REC:0.79 PRE:0.87
(Epoch 10) VALID LOSS:1.2163 ACC:0.93 F1:0.84 REC:0.82 PRE:0.87
(Epoch 11) VALID LOSS:1.1423 ACC:0.94 F1:0.87 REC:0.84 PRE:0.91
(Epoch 12) VALID LOSS:1.1928 ACC:0.95 F1:0.89 REC:0.86 PRE:0.91
(Epoch 13) VALID LOSS:1.1717 ACC:0.94 F1:0.88 REC:0.85 PRE:0.90
(Epoch 14) VALID LOSS:1.1248 ACC:0.94 F1:0.87 REC:0.83 PRE:0.91
(Epoch 15) VALID LOSS:1.1668 ACC:0.94 F1:0.87 REC:0.85 PRE:0.90

If you are using new datasets, we couldn't guarantee you can get similar results. Additionally, if you have an imbalance dataset, I think that won't be a problem as long as the train, valid, and test distribution are similar. In case your dataset is highly imbalanced (e.g.: 99% vs 1% class ratio), I would suggest using other metrics that don't rely on a single threshold such as AUROC and AUPRC to get a more representative evaluation score.

cynthiasbahri commented 2 years ago

Hi @SamuelCahyawijaya Thank you very much for the response and solution that you gave, I really look forward to this, now the graph look better

IndoNLP / indonlu

Training and validation accuracy multi label classification #36