asahi417 / tner

Language model fine-tuning on NER with an easy interface and cross-domain evaluation. "T-NER: An All-Round Python Library for Transformer-based Named Entity Recognition, EACL 2021"
https://aclanthology.org/2021.eacl-demos.7/
MIT License
373 stars 41 forks source link

program crashes upon reaching validation step #28

Closed JanFreise closed 1 year ago

JanFreise commented 2 years ago

Hi,

since IOB-Format doesnt' work yet (at least for me), i tried out the standard way using the datasets you provide on huggingface

Training also aborts with these. I guess due to mislabeling while importing (data is called "validation" on huffingface and you define it as "valid")

/content/drive/MyDrive/Colab Notebooks/tner/ner_model.py in evaluate(self, dataset, dataset_name, local_dataset, batch_size, dataset_split, cache_dir, cache_file_feature, cache_file_prediction, span_detection_mode, return_ci, unification_by_shared_label, separator) 328 concat_label2id=self.label2id, 329 cache_dir=cache_dir) --> 330 assert dataset_split in data, f'{dataset_split} is not in {data.keys()}' 331 output = self.predict( 332 inputs=data[dataset_split]['tokens'],

AssertionError: valid is not in dict_keys(['train', 'validation', 'test'])

Started via:

from tner import GridSearcher searcher = GridSearcher( checkpoint_dir='./logs/ckpt_test', dataset="tner/wnut2017", # either of dataset (huggingface dataset) or local_dataset (custom dataset) should be given

local_dataset=local_dataset,

model="dbmdz/bert-base-german-cased", epoch=1, epoch_partial=1, n_max_config=1, batch_size=32, gradient_accumulation_steps=[4], crf=[True], lr=[1e-4, 1e-5], weight_decay=[None], random_seed=[42], lr_warmup_step_ratio=[None], max_grad_norm=[None], use_auth_token=True ) searcher.train()

Best, Jan

asahi417 commented 1 year ago

Thanks for catching the error! The GridSearcher https://github.com/asahi417/tner/blob/master/tner/ner_trainer.py#L293 has the input dataset_split_valid, by which you can specify the split for validation. As a default, it was set as valid but as you mentioned, the huggingface dataset format follows validation but not valid so that causes error. I will fix it by set validation as a default, but meanwhile please set it directly (dataset_split_valid='validation').

JanFreise commented 1 year ago

ah i see. thanks!!