Closed codebynao closed 3 years ago
Okay the solution was very simple...
I just had to specify that I don't want to use the tokeniser:
sentence = Sentence(text, use_tokenizer=False)
Now it works as expected:
{
"text": "Mon adresse mail est naomi@gmail.com",
"labels": [],
"entities": [
{
"text": "naomi@gmail.com",
"start_pos": 21,
"end_pos": 36,
"labels": [
{
"_value": "EMAIL",
"_score": 0.9972683191299438
}
]
}
]
}
I am trying to train a model in French with some custom NER labels, however I can't manage to detect emails properly.
My first dataset looked like:
I tested my model with:
Only
gmail.com
is detected as B-EMAIL.I also noticed that the email was splitted
naomi@gmail.com => naomi @ gmail.com
so on another try I changed my dataset format to the following to see if it would make a difference:Both training formats resulted in only
gmail.com
being labelled.My training file:
I am really new to ML, it is actually my first time trying to create a custom model. I don't really know what I should try next. Is the problem coming from my dataset format, my training or somewhere else?
Any help or guidance will be highly appreciated!