asahi417 / tner

Language model fine-tuning on NER with an easy interface and cross-domain evaluation. "T-NER: An All-Round Python Library for Transformer-based Named Entity Recognition, EACL 2021"
https://aclanthology.org/2021.eacl-demos.7/
MIT License
373 stars 41 forks source link

Strange prediction bahavior #22

Closed JaouadMousser closed 2 years ago

JaouadMousser commented 2 years ago

Hi Asahi,

I am having an issue related to a model I trained using tner. I used a custom dataset with labels like "INCEPTION_DATE", "PARNTER_COUNTRY" etc. The training itself seems to go well, but when I tried to call the predict function, I start getting some different labels like "Date", "City" and other entities which were not in my data.

Is there anything I am missing here?

I would appreciate any advise.

johann-petrak commented 2 years ago

I see the same behaviour and I also do not understand what is going on: I am training on CONLL2003 which have types like PER, LOC, ORG, but the trained model returns "location", "organization", etc. Why/How is this done? I would have expected that loading the custom dataset will show the possible labels which occur in the dataset.

This is extremely confusing, what is going on????

johann-petrak commented 2 years ago

Oh, sorry, I think my issue is actually different, what I see seems to happen in https://github.com/asahi417/tner/blob/83eb39f4afb8ef0d229f10e7546c63760d8d872d/tner/get_dataset.py#L429 where certain known types are mapped to some pre-defined type.

This is not a useful behavior in situations where we really need exactly the types in the dataset, could we please make this mapping optional?

Update: yes, deactivating the processing in that line makes the model use the original types

JaouadMousser commented 2 years ago

Thanks Johann. I will try to deactivate the mapping part. But I am not sure it is going to solve the problem given that my labels are not part of the map provided in the code.

johann-petrak commented 2 years ago

@JaouadMousser which model do you start with? is it one from huggingface hub?

JaouadMousser commented 2 years ago

Yes, it is the bert-base-cased-mutlilangual

johann-petrak commented 2 years ago

OK, to me this looks very weird, there should be now way how other chunk labels should get used with that base model. Is this reproducable?

asahi417 commented 2 years ago

Hi @JaouadMousser, is there any chance that I can have a look a few examples of your dataset? It doesn't need to be a subset of your original data, but better if the file is in a same format as yours and contains all the entity types your dataset has. With the dataset, I could run model training and inference in my end to see what's going on there.

JaouadMousser commented 2 years ago

Hi,

I could find where the problem is coming from. The code is expecting two-parts labels like B-ORG. In my case I have three-parts labels like "B-INCEPTION-DATE", "B-PERIOD-DATE", etc.. The decode_ner_tags functions splits this labels and take the last part of it. In my case, since I have many labels ending by "-DATE" or "-CITY", etc, the predict function returns "DATE" for all label ending by DATE, etc, @johann-petrak, @asahi417 thank you for your support.

asahi417 commented 2 years ago

Hi @JaouadMousser Thank you for figuring out the issue. This should be handled in more wise way indeed (eg. take the first part of "{B,I}-" and keep the rest). I'll add this to my todo list for next version. Really appreciate it!