Closed yiwang-verisk closed 4 years ago
Of course! Define your own processor
class that inherits from data_processor
. Make sure your processor
can read the data properly and that it returns the correct labels. I think that's all you need to change. Please let me know how it goes.
Edit: Turns out this isn't that straightforward.
Do I need to change my labels like "A", "B", "C" into "0", "1", "2"? or I can keep the original str label for different categories?
Any label is fine as long as it's a String. Just make sure that you include all the labels that are present in your data. For example, if you have the labels "A", "B", and "C", your DataProcessor class should look something like this.
class NewDataProcessor(DataProcessor):
"""Processor for the multiclass data sets"""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_labels(self):
"""See base class."""
return ["A", "B", "C"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
guid = "%s-%s" % (set_type, i)
text_a = line[3]
label = line[1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
This assumes that your dataset is in a tsv
file with the columns id, label, alpha, and text, in that order. The labels you use will be mapped before the examples are converted (the line of code below from the convert_examples_to_features()
function), so it shouldn't matter what you use as labels, as long as all the labels are there.
label_map = {label : i for i, label in enumerate(label_list)}
Thanks so much! I will try your code.
No problem! Let me know if anything goes wrong.
Hi there,
i tried to run a multi-class classification using your code but i ran into an error during the training step:
INFO:main:Creating features from dataset file at data/ 100%|██████████| 45/45 [00:00<00:00, 197.45it/s] INFO:main:Saving features into cached file data/cached_train_bert-base-german-cased_128_multi INFO:main: Running training INFO:main: Num examples = 45 INFO:main: Num Epochs = 1 INFO:main: Total train batch size = 8 INFO:main: Gradient Accumulation steps = 1 INFO:main: Total optimization steps = 6 Epoch: 0%| | 0/1 [00:00<?, ?it/s] Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic HBox(children=(IntProgress(value=0, description='Iteration', max=6, style=ProgressStyle(description_width='ini…
RuntimeError Traceback (most recent call last)
This is not as straightforward as I thought (I really should have realized this, sorry). The issue is that the pre-trained models are designed for binary classification. It doesn't seem trivial to adapt these models for multiclass classification. Changinng the config
files and such breaks the loading of weights.
I think one way to do it would be to use the base model class (e.g: RobertaModel
instead of RobertaForSequenceClassification
) and add your own classification head.
Multiclass classification is now supported on the Simple Transformers library.
Can your code used for situation that number of labels > 2?