Closed FrancescoCasalegno closed 3 years ago
The following resources have been created.
The following resources have been created.
- NER – Collect annotations with Prodigy to train a new entity type – Confluence page with detailed instructions on how to use Prodigy to train a new entity type.
@Stannislav @pafonta @jankrepl Feel free to review those resources and give feedback.
Hello @EmilieDel and @FrancescoCasalegno,
Feel free to review those resources and give feedback.
Awesome work on coming with a defined workflow (PDF diagram) and an how-to (Confluence page)!
Regarding the questions in the PDF diagram on ner.manual
/ ner.correct
vs ner.teach
:
Why not distinguishing them on their usage?
So, when the training is:
ner.manual
.ner.correct
.ner.teach
.Besides, I would have the following questions:
Maybe we could number the diagram boxes to help us refer to them?
If we use sense2vec
(done in the Confluence counterpart), I think that training / improving such a model for neuroscience should be part of the diagram (PDF).
Why making the steps ner.teach
and ner.correct
optional before deploying a NER model in production?
The box Can we define an explicit pattern to fix the error with a Rule Based NER step? has no alternative no
. If that's intentional, maybe it doesn't make sense to have that box, then.
Hello @pafonta,
Maybe @FrancescoCasalegno can answer better but I can say:
So, when the training is: from scratch without any existing model: ner.manual. from scratch but an existing model gives somehow good suggestions: ner.correct. continued from an existing model: ner.teach.
sense2vec
is used to extract a list of patterns that are going to be pre-highlighted with the recipe ner.manual
(sense2vec.teach
and sense2vec.to-patterns
/terms.to-patterns
). There is no real training in this step. Moreover, this step is also represented in the diagram (it is the first blue block of the first use case).ner.manual
or even an entity ruler can make the NER, it is maybe not needed to ask scientists more annotations.no
regarding that box is not well defined yet. Indeed, it is part of the questions on the left. Should we try to create a new model for this case or a binary classifier and how? This part is not decided yet. Hello @EmilieDel !
Thank you for your answers.
So, when the training is:
- from scratch without any existing model: ner.manual.
- from scratch but an existing model gives somehow good suggestions: ner.correct.
- continued from an existing model: ner.teach.
I think it is the implicit idea of it but maybe it is easier to keep the current logic for the scientist.
But then, the diagram is currently saying that extracting a new entity would always require to train a model from scratch.
Is it intended? I would argue that we would want not to start from scratch each time.
1. Yes it is something we can do
Thanks!
2. I am not sure I understand your question here. `sense2vec` is used to extract a list of patterns that are going to be pre-highlighted with the recipe `ner.manual` (`sense2vec.teach` and `sense2vec.to-patterns`/`terms.to-patterns`). There is no real training in this step. Moreover, this step is also represented in the diagram (it is the first blue block of the first use case).
sense2vec
uses vectors that need to be trained. As said in the Confluence page (i.e. Can't find seed term 'astrocyte' in vector), this is expected that the trained vectors are not relevant for our use case. Indeed, some words could be missing from the available vectors or have a meaning different from their neuroscience one.
3. Because if the model is good enough after the first `ner.manual` or even an entity ruler can make the NER, it is maybe not needed to ask scientists more annotations.
I would argue that using ner.teach
or ner.correct
is a good (only?) way to make sure the model corresponds to the expectations from the scientists. Indeed, if the scientists have to reject a lot of the model predictions, then the model is not ready to be deployed in production.
Besides, I would also argue that doing like this could bring a neat solution to part of the questions about the test set (questions 4 and 5 in red on the diagram) while improving the underlying model.
4. The alternative for `no` regarding that box is not well defined yet. Indeed, it is part of the questions on the left. Should we try to create a new model for this case or a binary classifier and how? This part is not decided yet.
Oh. Is it then the question 3 in red on the diagram?
Really amazing job! Thanks for trying to write down the exact process.
I have some questions + comments on what I think is missing (if you addressed them already, sorry in advance).
data_and_models/annotations
and use the filename to encode metadata.entity ruler < our model <= other human
.Hello @pafonta, @jankrepl,
Thanks for your feedback! (and sorry in advance for my long answer! 😅)
I think you made similar point concerning starting from scratch model or from some pretrained one. We can definitely add this possibility into the diagram.
Regarding external resources, I think there are indeed two main levels it could help:
But then, the diagram is currently saying that extracting a new entity would always require to train a model from scratch. Is it intended? I would argue that we would want not to start from scratch each time.
Ideally, if there are models outside, it would be great. But do you think it is going to be the case ? I have the feeling we have bigger chances to find annotated datasets than directly model trained. But I agree, we can definitely add this possibility into the diagram.
sense2vec
uses vectors that need to be trained. As said in the Confluence page (i.e. Can't find seed term 'astrocyte' in vector), this is expected that the trained vectors are not relevant for our use case. Indeed, some words could be missing from the available vectors or have a meaning different from their neuroscience one.
I am seeing this slightly differently I think. In my opinion, the first (and biggest) source to create the desired pattern lists is going to be online resources (ontology, ...). For me, the sense2vec
step is really here to help to increase the number of patterns.
Moreover, this entire step (=creation of a pattern list) is really here:
However, it is a really good point that training of the vectors could be needed/useful and should be considered. But I can imagine it is going to be much longer and done once or maybe from time to time, not every time we need to train a new entity type. What do you think ?
I would argue that using
ner.teach
orner.correct
is a good (only?) way to make sure the model corresponds to the expectations from the scientists. Indeed, if the scientists have to reject a lot of the model predictions, then the model is not ready to be deployed in production.
There are pros and cons in my opinion. If the number of annotations is good enough (always debatable for sure), splitting annotations into train and test sets should be enough to have a fair evaluation. I don't think we always need to ask annotators to directly correct the model.
Besides, I would also argue that doing like this could bring a neat solution to part of the questions about the test set (questions 4 and 5 in red on the diagram) while improving the underlying model.
For me, correcting the model through ner.teach
and ner.correct
recipes can be very useful to have a bigger number of annotations (and then help the training) but is also already a bit biased, so maybe not the most suitable to create a test set.
Oh. Is it then the question 3 in red on the diagram?
Yes it is this question and maybe in a more general perspective: should we go for a binary classifier or train a new NER model.
Maybe unrealistic, but when we are given a new entity type manual annotation is not the only option. We can actually check online whether there are supervised datasets publicly available containing that entity type. Note that this is very much related to starting from some pretrained model vs blank model.
It is a really good point, we can definitely add a step to integrate this possibility.
IMO we should also pay attention to how we store the JSONL in some nice and systematic way. Currently, we just dump them all inside of data_and_models/annotations and use the filename to encode metadata.
Good observation. I think the decision to go to one entity type per NER model is going to make thinks easier. But we definitely need to decide on some convention.
What about the interrater agreement? I guess it would be nice to have an upper bound on what the performance of the model could be. Also it would be nice present our results as
entity ruler < our model <= other human
.
Yes, that is one of the questions that are still to investigate (see question 4 on the diagram). It would be ideal to have this interrater for sure!
Hello @EmilieDel,
I would argue that we would want not to start from scratch each time.
Ideally, if there are models outside, it would be great. But do you think it is going to be the case ? I have the feeling we have bigger chances to find annotated datasets than directly model trained. But I agree, we can definitely add this possibility into the diagram.
Most of the available NER models are for our domain, the biomedical domain.
Most of the available annotated datasets are research benchmarks. There is then a high probability that there are corresponding models available and even achieving SOTA.
So, yes, I would say that we would have base models in most of the case. But of course, that depends on the final list of entities to recognize.
NB: #320 would help us knowing if an existing model would suit our needs for some entities.
sense2vec
uses vectors that need to be trained. As said in the Confluence page (i.e. Can't find seed term 'astrocyte' in vector), this is expected that the trained vectors are not relevant for our use case. Indeed, some words could be missing from the available vectors or have a meaning different from their neuroscience one.However, it is a really good point that training of the vectors could be needed/useful and should be considered. But I can imagine it is going to be much longer and done once or maybe from time to time, not every time we need to train a new entity type. What do you think ?
We would need to re-train each time we add significant papers to the literature database. Indeed, it would help us prevent semantic shift.
Besides, using general pretrained vectors like the ones from Reddit would reinforce the "obvious" while we would want to capture the difficult cases during annotation by experts. One practical side-effect is that we could conclude that a rule-based model, using patterns from these general vectors and evaluated on a test set also selected with these vectors, is good for production, while in production it has catastrophic performances. We have seen this, for example, with the need to manually clean-up some totally unrelated entities recognized by our models (data_and_models/annotations/ner/rule_based_patterns.jsonl
) like smartphone
or taxi
.
I would argue that using
ner.teach
orner.correct
is a good (only?) way to make sure the model corresponds to the expectations from the scientists. Indeed, if the scientists have to reject a lot of the model predictions, then the model is not ready to be deployed in production.There are pros and cons in my opinion. If the number of annotations is good enough (always debatable for sure), splitting annotations into train and test sets should be enough to have a fair evaluation. I don't think we always need to ask annotators to directly correct the model.
Besides, I would also argue that doing like this could bring a neat solution to part of the questions about the test set (questions 4 and 5 in red on the diagram) while improving the underlying model.
For me, correcting the model through
ner.teach
andner.correct
recipes can be very useful to have a bigger number of annotations (and then help the training) but is also already a bit biased, so maybe not the most suitable to create a test set.
Not doing ner.teach
nor ner.correct
but building a test set is a good option too. I was just thinking that with this option, we would need to put extra efforts in building the test set. We can see from the benchmark papers in NLP/NLU that building a good one is not trivial. We had also seen this for the PATHWAY entity in #248, #319, #318.
Biased in which way(s)?
See the new PDF version here: ner_improvementprocess-v1.1.pdf
Check out also the new Confluence page.
This revision will try to address @jankrepl and @pafonta reviews and other points as well.
1.4.2
etc.).sense2vec
(if most of seeds are out-of-vocabulary).@jankrepl and @pafonta can you have a look to see if we implemented all your requests?
@jankrepl and @pafonta can you have a look to see if we implemented all your requests?
Perfect! Thank you!
Hello @FrancescoCasalegno!
can you have a look to see if we implemented all your requests?
I feel that my review was taken into account. Thank you!
Not to ask for change but more for comment:
I think we could make more use of the active learning feature of Prodigy (ner.teach
) to go faster to the best model. Indeed, at the moment (workflow v1.1), ner.teach
is used as a 'last resort' method.
Hi @pafonta,
First of all, thank you again for your feedback.
we could make more use of the active learning feature of Prodigy (
ner.teach
) to go faster to the best model.
Maybe to better justify our (current) decision to use ner.teach
only after some annotations have already been collected, I can point out the following points.
ner.teach
uses active learning in the sense that it chooses samples where the model
is not too confident about the prediction—~but this model is not updated in the loop au fur et à mésure that you give your feedback with ner.teach
,~ so it probably makes sense to start with a model that already understands something
Edit: correction, see https://github.com/BlueBrain/Search/issues/342#issuecomment-825738132ner.teach
collects only Yes
/No
feedback from the user; it is apparently possible to train the model using prodigy
and this kind of annotations (something like prodigy train ner --binary ...
) but
--binary
training is possible with spaCy
(and we are leaving prodigy train
in favor of spacy train
)No
Yes
/No
feedback is faster to give for the user, but carries less information than fully annotated samples, so I think it makes sense only once the bulk of learning has already taken placener.silver-to-gold
command to generate fully annotated samples from the binary ones; but if there are too many No
s then this step will take a long timeIn any case, consider that ner.teach
has never been used in the past to collect annotations for Blue Brain Search
—so I would say that as soon as we will execute this new process with our users we'll also be able to improve and modify the it based on what we will find out :)
Hello @FrancescoCasalegno,
First of all, thank you again for your feedback.
:)
Thank you for the detailed clarifications!
The points make sense.
I have however a different understanding for the two following points.
ner.teach
uses active learning in the sense that it chooses samples where themodel
is not too confident about the prediction—but this model is not updated in the loop au fur et à mésure that you give your feedback withner.teach
, so it probably makes sense to start with a model that already understands something
The model is updated in the loop, according to the documentation and the code of prodigy.recipes.ner.teach
(i.e. calls to model.update
).
I am not sure the same thing of
--binary
training is possible withspaCy
(and we are leavingprodigy train
in favor ofspacy train
)
Couldn't ner.silver-to-gold
convert these Yes/No into regular annotations, which are then usable with spacy train
? See 'use case' in https://prodi.gy/docs/recipes#ner-silver-to-gold.
Hello @pafonta,
The model is updated in the loop, according to the documentation and the code of prodigy.recipes.ner.teach (i.e. calls to model.update).
I think you are right in fact! I will correct my comment above with a strikethrough.
Couldn't
ner.silver-to-gold
convert these Yes/No into regular annotations [...]?
Yes absolutely! This is indeed what we do in our Process:
But notice that this ner.silver-to-gold
recipe still requires manual intervention, see more details here.
First version Done. Should there be any ideas to improve the process, we'll create a dedicated Issue and upgrade the version.
Context
As we have been requested, it is of highest importance that not only our NER models improve their accuracy, but also that we implement features and define processes make it as seamless as possible to improve our NER models by allowing users to address the two following use cases.
Ideas for this process
prodigy
process here:MAMMAL
is sub-type ofANIMAL
)EntityRuler
?Actions