deepset-ai / FARM

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.
https://farm.deepset.ai
Apache License 2.0
1.73k stars 247 forks source link

Different prediction after Huggingface conversion #821

Closed markusgl closed 2 years ago

markusgl commented 3 years ago

Question

Hi,

I fine tuned a german uncased BERT model ("dbmdz/bert-base-german-uncased") for NER tasks using Germeval2014+some custom examples and converted it to Huggingface using this example. After conversion the prediction and tokenization is different than before when using the Farm Inferencer.

Example Sentence: "Ich heiße Peter und wohne in Wilhelmshaven".

Prediction with FARM Inferencer

[{'start': 10, 'end': 15, 'context': 'Peter', 'label': 'PER', 'probability': 0.99981683}, 
{'start': 29, 'end': 43, 'context': 'Wilhelmshaven.', 'label': 'LOC', 'probability': 0.9986634}]

Prediction with converted model and Huggingface pipeline nlp = pipeline('ner', model=model, tokenizer=tokenizer, grouped_entities=True)

[{'entity_group': 'PER', 'score': 0.9998180270195007, 'word': 'Peter', 'start': 10, 'end': 15}, 
{'entity_group': 'X', 'score': 0.9999816417694092, 'word': '##e', 'start': 24, 'end': 25}, 
{'entity_group': 'LOC', 'score': 0.999666690826416, 'word': 'Wilhelms', 'start': 29, 'end': 37}, 
{'entity_group': 'X', 'score': 0.999923825263977, 'word': '##haven.', 'start': 37, 'end': 43}]

What could cause auch a problem? It looks like a wrong WordPiece tokenization (e.g. Peter ##e) and wrong classification.

Thanks in advance for any help Markus

Background: I would like to use Huggingface over Farm Inferencer as the prediction on CPU is faster (0.5s to 0.05s on my i7)

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 21 days if no further activity occurs.