Closed rohitjajee closed 2 years ago
Hello @rohitjajee this is a known issue that is fixed in master branch and will be part of a bugfix release that's coming soon.
In the meantime, you can install the master branch through pip to get the fix now:
pip install --upgrade git+https://github.com/flairNLP/flair.git
Hello @alanakbik,
Thanks for your quick response. The code runs without any error now, but it is not calculating the metrics correctly. Please see the output below. All the values are zero.
By class: precision recall f1-score support
uberon 0.0000 0.0000 0.0000 6540
pr 0.0000 0.0000 0.0000 6294
go_bp 0.0000 0.0000 0.0000 3527
so 0.0000 0.0000 0.0000 3342
ncbitaxon 0.0000 0.0000 0.0000 3098 chebi 0.0000 0.0000 0.0000 2189 cl 0.0000 0.0000 0.0000 1735 Disease 0.0000 0.0000 0.0000 0 go_cc 0.0000 0.0000 0.0000 1153 mop 0.0000 0.0000 0.0000 95 go_mf 0.0000 0.0000 0.0000 91
micro avg 0.0000 0.0000 0.0000 28064 macro avg 0.0000 0.0000 0.0000 28064 weighted avg 0.0000 0.0000 0.0000 28064 samples avg 0.0000 0.0000 0.0000 28064
Loss: 2.3716938495635986'
Yes, this dataset has different labels like ncbitaxon
or go_cc
, while the tagger predicts entities of label disease
which are not annotated in this dataset. So the score of 0. is correct because there is a label mismatch.
@mariosaenger is it correct that CRAFT_V4 has no disease labels?
@alanakbik you are right!! my bad!. Craft_v4 has no disease labels. Thank you
@alanakbik CRAFT_V4
is the original version of the corpus - without any label mapping. For training HunFlair we created distinct corpora for each entity type (e.g. HUNER_GENE_CRAFT_V4
or HUNER_SPECIES_CRAFT_V4
) which map the corpus-specific tags to more general ones (e.g. ncbitaxon to species). However these corpora are only focused on one distinct entity type. This is necessary due to the multi gold standard training procedure of HunFlair.
Unfortunately, we don't provide a corpus version containing all HunFlair-supported entity types. This could be easily implemented, e.g.:
from pathlib import Path
from datasets import CRAFT_V4
from datasets.biomedical import HunerDataset, InternalBioNerDataset, SPECIES_TAG, GENE_TAG, CHEMICAL_TAG, \
filter_and_map_entities
class HUNER_CRAFT_V4(HunerDataset):
""" HUNER version of the CRAFT corpus."""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
@staticmethod
def split_url() -> str:
return "https://raw.githubusercontent.com/hu-ner/huner/master/ner_scripts/splits/craft_v4"
def to_internal(self, data_dir: Path) -> InternalBioNerDataset:
corpus_dir = CRAFT_V4.download_corpus(data_dir)
corpus = CRAFT_V4.parse_corpus(corpus_dir)
entity_type_mapping = {
"ncbitaxon": SPECIES_TAG, # Map corpus-specific tags to general ones
"pr": GENE_TAG,
"chebi": CHEMICAL_TAG
}
return filter_and_map_entities(corpus, entity_type_mapping)
Please note that the CRAFT corpus contains further entity annotations we don't support at all (e.g. cells or anatomical entities).
@mariosaenger thanks for the info and the code example!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
My code: import flair from flair.datasets import biomedical from flair.models import MultiTagger craft = biomedical.CRAFT_V4() hunflair_tagger = MultiTagger.load("hunflair") disease_tagger = hunflair_tagger.name_to_tagger["hunflair-disease"] print(disease_tagger.evaluate(craft.test, 'ner'))
Error: IndexError Traceback (most recent call last)