Closed zakraicik closed 5 years ago
Well, the model starts out dead certain that the entities you're tagging as BIRTHDAY
are DATE
entities. You have to unwind that knowledge to retype them, and so the losses are very high. I think this is the explanation anyway -- maybe something else is going on.
Have you considered just adding a second tagging model, that bootstraps off the first one and retypes the DATE
entities? It might make the problem easier to learn.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Issue is similar to #2783 but the resolution isn't clear to me.
I have roughly 30k documents that I am using to update the large english spaCy model. Additionally, I am using the 30K to add an additional entity label to the NER model. Each document has been fully annotated with existing entities identified using the large english model as well as the new entity label I am trying to add. The documents are very long, and any individual document contains a lot of entities (~25).
The entity I am adding is birthday, so as part of my dataset creation I have to override the label for some entities spaCy is tagging as dates as birthdays.
After training, the performance metrics seem okay (ents_r = 86). Entities tagged on the validation set seem to make sense and the results seem reasonable.
However, the losses during training are very high (in the millions). What is causing this? I am only adding a single label that does not appear too often. I attached some code snippets below.
Annotating existing entities using spaCy
Annotating entities tagged manually
Combining character offsets
Create training data
TRAIN_DATA = [(df.iloc[i,:]['clean_txt'],{'entities': FinalOffsets[df.iloc[i,:]['id']]}) for i, txt in tqdm(enumerate(df['clean_txt'].tolist()))]
Train Model
I attached a screenshot of the performance metrics I am seeing after the first iteration. The performance metrics improve in the later iterations, eventually reaching ents_r of 86. However, the loss remains huge. How is this possible?