deepset-ai / FARM

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.
https://farm.deepset.ai
Apache License 2.0
1.74k stars 247 forks source link

How to use calculate_class_weights with NERProcessor #813

Closed markusgl closed 2 years ago

markusgl commented 3 years ago

Question Hi, I am currently fine-tuning a BERT-Model on a custom NER dataset. The classes are very imbalanced, so I tried to use the _calculate_classweights method on my data silo. Unfortunately I ran into errors and I am not really sure which "task_type" to set. My Script is based on this example with a few parameter adjustments for my dataset.

When I run the following code on the above example, I get a TypeError, as "task_type" is not set:

class_weights = data_silo.calculate_class_weights(task_name="ner")
/usr/local/lib/python3.7/dist-packages/farm/data_handler/data_silo.py in calculate_class_weights(self, task_name, source)
    545             raise Exception("source argument expects one of [\"train\", \"all\"]")
    546         for dataset in datasets:
--> 547             if "multilabel" in self.processor.tasks[task_name]["task_type"]:
    548                 for x in dataset:
    549                     observed_labels += [label_list[label_id] for label_id in (x[tensor_idx] == 1).nonzero()]

TypeError: argument of type 'NoneType' is not iterable

So I tried to add the "task_type" to the NERProcessor by doing this (not exactly sure which "task_type" to set):

processor.add_task(name='ner', metric="seq_f1", task_type='ner', label_list=ner_labels)
/usr/local/lib/python3.7/dist-packages/farm/data_handler/data_silo.py in <listcomp>(.0)
    549                     observed_labels += [label_list[label_id] for label_id in (x[tensor_idx] == 1).nonzero()]
    550             else:
--> 551                 observed_labels += [label_list[x[tensor_idx].item()] for x in dataset]
    552 
    553         #TODO scale e.g. via logarithm to avoid crazy spikes for rare classes

ValueError: only one element tensors can be converted to Python scalars

I also tried to set the "task_type" to "multilabel" which results in a IndexError:

processor.add_task(name='ner', metric="seq_f1", task_type='multilabel', label_list=ner_labels)
1 frames
/usr/local/lib/python3.7/dist-packages/farm/data_handler/data_silo.py in <listcomp>(.0)
    547             if "multilabel" in self.processor.tasks[task_name]["task_type"]:
    548                 for x in dataset:
--> 549                     observed_labels += [label_list[label_id] for label_id in (x[tensor_idx] == 1).nonzero()]
    550             else:
    551                 observed_labels += [label_list[x[tensor_idx].item()] for x in dataset]

IndexError: list index out of range

Which "task_type" do I have to set or am I doing something wrong in the code? Thank your for you help!

Additional context Add any other context or screenshots about the question (optional).

processor = NERProcessor(tokenizer=tokenizer, 
                          max_seq_len=128, 
                          data_dir=Path(DATA_DIR),
                          delimiter=" ",
                          metric="seq_f1", 
                          label_list=ner_labels)
EikeKohl commented 3 years ago

I have the same issue. After investigating the calculate_class_weights() implementation it seems like the method is not implemented for the NER task yet. Are you guys working on it already?

rdemorais commented 3 years ago

+1 here

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 21 days if no further activity occurs.