Closed None-Such closed 1 year ago
Hi @None-Such
Roberta-base
is clearly fitting to only 1 class and not really learning, it makes totally sense. In general, the No free lunch theorem still holds for pretrained models.set of internal problems
meaning that the data is not publicly available. trainer.train(..., optimizer=<optimizer_cls>)
or trainer.fine_tune(..., optimizer=<optimizer_cls>)
respectively. For madgrad you have to find an implementation or code it down yourself.@helpmefindaname - Most grateful =)
Objective
I am trying to use Flair to replicate the Kaggle Toxic Comment Classification Challenge which seeks to identify and classify toxic online comments.
[https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge]()
Approach
To do this I started with the Flair Tutorial: Train a text classifier
https://flairnlp.github.io/docs/tutorial-training/how-to-train-text-classifier
I made 2 minor changes to accomodate the Kaggle Challenge training data:
> see code below
Clarification
To avoid any possible confusion, let me clarify one subtle aspect of Multi-label classification:
So the Kaggle data requires multi_label=True, unlike the TREC_6 data in the Flair Tutorial,
Performance
However, as far as I can tell, 'allow_examples_without_labels=True' does not work . . . as it causes inference to entirely fail for me =(
To work around this, I tagged all unlabelled records as 'benign'
I proceeded to do multiple runs using different models (using Flair 0.12.2), but I got strange behavior both at training time and inference time which differs based on the training model used:
> see F1 Scores below
The Kaggle contest winner had a score of: 0.98856. Interesting that the winner's approach seems to align with Flair Stacked Embeddings: https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/discussion/52557
Questions
Based on reviewing related GitHub Issues I have the following questions:
Question # 1 - Is my code below correct?
1a. Is there any thing wrong with my simple adaptation of the text classifier tutorial (code below) given I am targeting the Kaggle Toxic Comment multi-label data?
Question # 2 - Does the performance make sense?
2a. Is it possible for distilbert-base-uncased to provide a higher accuracy than roberta-base? 2b. Why would the roberta-base and xlm-roberta-base models Epoch level F1-Scores seem to get stuck after the 1st Epoch for? 2c. Why would the roberta-base and xlm-roberta-base models get zero for some classes in the test set (see Model Specific Results at bottom)?
Question # 3 - @alanakbik made the comment 'everything seems to be working . . . with our multi-label datasets', in Issue: https://github.com/flairNLP/flair/issues/678#issuecomment-485863025
3a. Which multi-label dataset was @alanakbik referring to in the that issue? 3b. And what settings are used to run it?
Question # 4 - @alanrios2001 made the comment 'training with torch's Adam optimizer, using MADGRAD the f1-score just work's fine..' in Issue: https://github.com/flairNLP/flair/issues/678#issuecomment-1526665624.
4a. How does one set an alternate optimizer when using Flair?
Question # 5 - Guidance for setting the learning_rate?
5a. Is there any model specific guidance on setting the learning_rate when fine-tuning a transformer model? 5b. Any general guidance?
Question # 6 - @helpmefindaname mentioned adjusting the loss_weights in Issue: https://github.com/flairNLP/flair/issues/2869#issuecomment-1191657808
6a. Is this an option worth pursuing? 6b. If so, what are reasonable weights?
Code
Adapted from Flair Tutorial "Train a text classifier"
Corpus Statistics
print(corpus.obtain_statistics())
}
Model Specific Results
distilbert-base-uncased
roberta-base
xlm-roberta-base
DocumentRNNEmbeddings ([WordEmbeddings('glove'),FlairEmbeddings('news-forward'),FlairEmbeddings('news-backward')], hidden_size=512, reproject_words=True, reproject_words_dimension=256)