FrancescoCasalegno commented 3 years ago

Context

As we have been requested, it is of highest importance that not only our NER models improve their accuracy, but also that we implement features and define processes make it as seamless as possible to improve our NER models by allowing users to address the two following use cases.

Add support for a new entity types.
Correct errors observed in predictions.

Ideas for this process

Get inspired by prodigy process here:
Is the new entity type a sub-type of an already existing entity type? (e.g. MAMMAL is sub-type of ANIMAL)
- If Yes, then redirect this problem to Ontology Linking and Blue Graph
How do we provide estimates on how many training samples will be needed? Can we do this iteratively e.g. using #276 learning curves?
Shall we always train a "statistical model" or consider using an EntityRuler?
In any case, maybe at least some training samples are needed for testing? How many?

Actions

[x] Create draft of process to add support for a new entity types.
[x] Create draft of process to correct errors observed in predictions.

FrancescoCasalegno commented 3 years ago

The following resources have been created.

NER – Collect annotations with Prodigy to train a new entity type – Confluence page with detailed instructions on how to use Prodigy to train a new entity type.

EmilieDel commented 3 years ago

The following resources have been created.

NER – Collect annotations with Prodigy to train a new entity type – Confluence page with detailed instructions on how to use Prodigy to train a new entity type.

ner_diagram_v1.0.pdf - Diagram considering the two use cases: (1) how to add a new entity type (2) how to correct errors observed in predictions.

@Stannislav @pafonta @jankrepl Feel free to review those resources and give feedback.

pafonta commented 3 years ago

Hello @EmilieDel and @FrancescoCasalegno,

Feel free to review those resources and give feedback.

Awesome work on coming with a defined workflow (PDF diagram) and an how-to (Confluence page)!

Regarding the questions in the PDF diagram on ner.manual / ner.correct vs ner.teach:

Why not distinguishing them on their usage?

So, when the training is:

from scratch without any existing model: ner.manual.
from scratch but an existing model gives somehow good suggestions: ner.correct.
continued from an existing model: ner.teach.

Besides, I would have the following questions:

Maybe we could number the diagram boxes to help us refer to them?
If we use sense2vec (done in the Confluence counterpart), I think that training / improving such a model for neuroscience should be part of the diagram (PDF).
Why making the steps ner.teach and ner.correct optional before deploying a NER model in production?
The box Can we define an explicit pattern to fix the error with a Rule Based NER step? has no alternative no. If that's intentional, maybe it doesn't make sense to have that box, then.

EmilieDel commented 3 years ago

Hello @pafonta,

Maybe @FrancescoCasalegno can answer better but I can say:

So, when the training is: from scratch without any existing model: ner.manual. from scratch but an existing model gives somehow good suggestions: ner.correct. continued from an existing model: ner.teach.

I think it is the implicit idea of it but maybe it is easier to keep the current logic for the scientist.

Yes it is something we can do
I am not sure I understand your question here. sense2vec is used to extract a list of patterns that are going to be pre-highlighted with the recipe ner.manual (sense2vec.teach and sense2vec.to-patterns/terms.to-patterns). There is no real training in this step. Moreover, this step is also represented in the diagram (it is the first blue block of the first use case).
Because if the model is good enough after the first ner.manual or even an entity ruler can make the NER, it is maybe not needed to ask scientists more annotations.
The alternative for no regarding that box is not well defined yet. Indeed, it is part of the questions on the left. Should we try to create a new model for this case or a binary classifier and how? This part is not decided yet.

pafonta commented 3 years ago

Hello @EmilieDel !

Thank you for your answers.

So, when the training is:

from scratch without any existing model: ner.manual.

from scratch but an existing model gives somehow good suggestions: ner.correct.

continued from an existing model: ner.teach.
I think it is the implicit idea of it but maybe it is easier to keep the current logic for the scientist.

But then, the diagram is currently saying that extracting a new entity would always require to train a model from scratch.

Is it intended? I would argue that we would want not to start from scratch each time.

1. Yes it is something we can do

Thanks!

2. I am not sure I understand your question here. `sense2vec` is used to extract a list of patterns that are going to be pre-highlighted with the recipe `ner.manual` (`sense2vec.teach` and `sense2vec.to-patterns`/`terms.to-patterns`). There is no real training in this step. Moreover, this step is also represented in the diagram (it is the first blue block of the first use case).

sense2vec uses vectors that need to be trained. As said in the Confluence page (i.e. Can't find seed term 'astrocyte' in vector), this is expected that the trained vectors are not relevant for our use case. Indeed, some words could be missing from the available vectors or have a meaning different from their neuroscience one.

3. Because if the model is good enough after the first `ner.manual` or even an entity ruler can make the NER, it is maybe not needed to ask scientists more annotations.

I would argue that using ner.teach or ner.correct is a good (only?) way to make sure the model corresponds to the expectations from the scientists. Indeed, if the scientists have to reject a lot of the model predictions, then the model is not ready to be deployed in production.

Besides, I would also argue that doing like this could bring a neat solution to part of the questions about the test set (questions 4 and 5 in red on the diagram) while improving the underlying model.

4. The alternative for `no` regarding that box is not well defined yet. Indeed, it is part of the questions on the left. Should we try to create a new model for this case or a binary classifier and how? This part is not decided yet.

Oh. Is it then the question 3 in red on the diagram?

jankrepl commented 3 years ago

Really amazing job! Thanks for trying to write down the exact process.

I have some questions + comments on what I think is missing (if you addressed them already, sorry in advance).

Maybe unrealistic, but when we are given a new entity type manual annotation is not the only option. We can actually check online whether there are supervised datasets publicly available containing that entity type. Note that this is very much related to starting from some pretrained model vs blank model.
IMO we should also pay attention to how we store the JSONL in some nice and systematic way. Currently, we just dump them all inside of data_and_models/annotations and use the filename to encode metadata.
What about the interrater agreement? I guess it would be nice to have an upper bound on what the performance of the model could be. Also it would be nice present our results as entity ruler < our model <= other human.

EmilieDel commented 3 years ago

Hello @pafonta, @jankrepl,

Thanks for your feedback! (and sorry in advance for my long answer! 😅)

I think you made similar point concerning starting from scratch model or from some pretrained one. We can definitely add this possibility into the diagram.

Regarding external resources, I think there are indeed two main levels it could help:

Checking if there are some lists of words/entities of the new entity type (helpful for a rules-based model, for the creation of the annotations sets, for manual annotations, ...)
Checking if there are datasets/models to help us (helpful for the training of models)

But then, the diagram is currently saying that extracting a new entity would always require to train a model from scratch. Is it intended? I would argue that we would want not to start from scratch each time.

Ideally, if there are models outside, it would be great. But do you think it is going to be the case ? I have the feeling we have bigger chances to find annotated datasets than directly model trained. But I agree, we can definitely add this possibility into the diagram.

sense2vec uses vectors that need to be trained. As said in the Confluence page (i.e. Can't find seed term 'astrocyte' in vector), this is expected that the trained vectors are not relevant for our use case. Indeed, some words could be missing from the available vectors or have a meaning different from their neuroscience one.

I am seeing this slightly differently I think. In my opinion, the first (and biggest) source to create the desired pattern lists is going to be online resources (ontology, ...). For me, the sense2vec step is really here to help to increase the number of patterns. Moreover, this entire step (=creation of a pattern list) is really here:

To help the annotations for scientists
To create the rule-based model
(Optional) to create the set of sentences to annotate

However, it is a really good point that training of the vectors could be needed/useful and should be considered. But I can imagine it is going to be much longer and done once or maybe from time to time, not every time we need to train a new entity type. What do you think ?

I would argue that using ner.teach or ner.correct is a good (only?) way to make sure the model corresponds to the expectations from the scientists. Indeed, if the scientists have to reject a lot of the model predictions, then the model is not ready to be deployed in production.

There are pros and cons in my opinion. If the number of annotations is good enough (always debatable for sure), splitting annotations into train and test sets should be enough to have a fair evaluation. I don't think we always need to ask annotators to directly correct the model.

Besides, I would also argue that doing like this could bring a neat solution to part of the questions about the test set (questions 4 and 5 in red on the diagram) while improving the underlying model.

For me, correcting the model through ner.teach and ner.correct recipes can be very useful to have a bigger number of annotations (and then help the training) but is also already a bit biased, so maybe not the most suitable to create a test set.

Oh. Is it then the question 3 in red on the diagram?

Yes it is this question and maybe in a more general perspective: should we go for a binary classifier or train a new NER model.

Maybe unrealistic, but when we are given a new entity type manual annotation is not the only option. We can actually check online whether there are supervised datasets publicly available containing that entity type. Note that this is very much related to starting from some pretrained model vs blank model.

It is a really good point, we can definitely add a step to integrate this possibility.

IMO we should also pay attention to how we store the JSONL in some nice and systematic way. Currently, we just dump them all inside of data_and_models/annotations and use the filename to encode metadata.

Good observation. I think the decision to go to one entity type per NER model is going to make thinks easier. But we definitely need to decide on some convention.

What about the interrater agreement? I guess it would be nice to have an upper bound on what the performance of the model could be. Also it would be nice present our results as entity ruler < our model <= other human.

Yes, that is one of the questions that are still to investigate (see question 4 on the diagram). It would be ideal to have this interrater for sure!

pafonta commented 3 years ago

Hello @EmilieDel,

I would argue that we would want not to start from scratch each time.

Ideally, if there are models outside, it would be great. But do you think it is going to be the case ? I have the feeling we have bigger chances to find annotated datasets than directly model trained. But I agree, we can definitely add this possibility into the diagram.

Most of the available NER models are for our domain, the biomedical domain.

Most of the available annotated datasets are research benchmarks. There is then a high probability that there are corresponding models available and even achieving SOTA.

So, yes, I would say that we would have base models in most of the case. But of course, that depends on the final list of entities to recognize.

NB: #320 would help us knowing if an existing model would suit our needs for some entities.

sense2vec uses vectors that need to be trained. As said in the Confluence page (i.e. Can't find seed term 'astrocyte' in vector), this is expected that the trained vectors are not relevant for our use case. Indeed, some words could be missing from the available vectors or have a meaning different from their neuroscience one.

However, it is a really good point that training of the vectors could be needed/useful and should be considered. But I can imagine it is going to be much longer and done once or maybe from time to time, not every time we need to train a new entity type. What do you think ?

We would need to re-train each time we add significant papers to the literature database. Indeed, it would help us prevent semantic shift.

Besides, using general pretrained vectors like the ones from Reddit would reinforce the "obvious" while we would want to capture the difficult cases during annotation by experts. One practical side-effect is that we could conclude that a rule-based model, using patterns from these general vectors and evaluated on a test set also selected with these vectors, is good for production, while in production it has catastrophic performances. We have seen this, for example, with the need to manually clean-up some totally unrelated entities recognized by our models (data_and_models/annotations/ner/rule_based_patterns.jsonl) like smartphone or taxi.

I would argue that using ner.teach or ner.correct is a good (only?) way to make sure the model corresponds to the expectations from the scientists. Indeed, if the scientists have to reject a lot of the model predictions, then the model is not ready to be deployed in production.

There are pros and cons in my opinion. If the number of annotations is good enough (always debatable for sure), splitting annotations into train and test sets should be enough to have a fair evaluation. I don't think we always need to ask annotators to directly correct the model.

Besides, I would also argue that doing like this could bring a neat solution to part of the questions about the test set (questions 4 and 5 in red on the diagram) while improving the underlying model.

For me, correcting the model through ner.teach and ner.correct recipes can be very useful to have a bigger number of annotations (and then help the training) but is also already a bit biased, so maybe not the most suitable to create a test set.

Not doing ner.teach nor ner.correct but building a test set is a good option too. I was just thinking that with this option, we would need to put extra efforts in building the test set. We can see from the benchmark papers in NLP/NLU that building a good one is not trivial. We had also seen this for the PATHWAY entity in #248, #319, #318.

Biased in which way(s)?

FrancescoCasalegno commented 3 years ago

Version 1.1

New version

See the new PDF version here: ner_improvementprocess-v1.1.pdf

Check out also the new Confluence page.

What's New?

This revision will try to address @jankrepl and @pafonta reviews and other points as well.

[x] ~~Add box numbering.~~ Won't do, at least for the moment. Reasons for this: (a) adding a box at some point would impact the numbering of all subsequent boxes; (b) there doesn't seem to be any easy way to implement the numbering other than this plugin—but the numbering is very naïve (boxes ordered by time of creation, no sub-numbering like 1.4.2 etc.).
[x] Add box to check whether public training data is available.
[x] Add box to check whether public pre-trained NER model is available.
[x] Add legend.
[x] Add box for considering pre-training sense2vec (if most of seeds are out-of-vocabulary).
[x] Add metadata: prodigy version, date, authors, ...
[x] Box "Can we define an explicit pattern to fix the error with a Rule Based NER step?" does not have a "Yes" option
[x] Add box for user to create a ticket explaining what they are trying to achieve (with a template as well)
[x] Add box to consider creation of a test set to measure inter-rater agreement.
[x] Specify where test set is created.
[x] Add box for the creation of set of sentences to be annotated for NER training.

FrancescoCasalegno commented 3 years ago

@jankrepl and @pafonta can you have a look to see if we implemented all your requests?

jankrepl commented 3 years ago

@jankrepl and @pafonta can you have a look to see if we implemented all your requests?

Perfect! Thank you!

pafonta commented 3 years ago

Hello @FrancescoCasalegno!

can you have a look to see if we implemented all your requests?

I feel that my review was taken into account. Thank you!

Not to ask for change but more for comment: I think we could make more use of the active learning feature of Prodigy (ner.teach) to go faster to the best model. Indeed, at the moment (workflow v1.1), ner.teach is used as a 'last resort' method.

FrancescoCasalegno commented 3 years ago

Hi @pafonta,

First of all, thank you again for your feedback.

we could make more use of the active learning feature of Prodigy (ner.teach) to go faster to the best model.

Maybe to better justify our (current) decision to use ner.teach only after some annotations have already been collected, I can point out the following points.

ner.teach uses active learning in the sense that it chooses samples where the model is not too confident about the prediction—~but this model is not updated in the loop au fur et à mésure that you give your feedback with ner.teach,~ so it probably makes sense to start with a model that already understands something Edit: correction, see https://github.com/BlueBrain/Search/issues/342#issuecomment-825738132
ner.teach collects only Yes/No feedback from the user; it is apparently possible to train the model using prodigy and this kind of annotations (something like prodigy train ner --binary ...) but
- I am not sure the same thing of --binary training is possible with spaCy (and we are leaving prodigy train in favor of spacy train)
- if the model is bad in the first place you'll have a very unbalanced dataset with lots of No
- this Yes/No feedback is faster to give for the user, but carries less information than fully annotated samples, so I think it makes sense only once the bulk of learning has already taken place
- we have decided to include in the process the ner.silver-to-gold command to generate fully annotated samples from the binary ones; but if there are too many Nos then this step will take a long time

In any case, consider that ner.teach has never been used in the past to collect annotations for Blue Brain Search—so I would say that as soon as we will execute this new process with our users we'll also be able to improve and modify the it based on what we will find out :)

pafonta commented 3 years ago

Hello @FrancescoCasalegno,

First of all, thank you again for your feedback.

:)

Thank you for the detailed clarifications!

The points make sense.

I have however a different understanding for the two following points.

ner.teach uses active learning in the sense that it chooses samples where the model is not too confident about the prediction—but this model is not updated in the loop au fur et à mésure that you give your feedback with ner.teach, so it probably makes sense to start with a model that already understands something

The model is updated in the loop, according to the documentation and the code of prodigy.recipes.ner.teach (i.e. calls to model.update).

I am not sure the same thing of --binary training is possible with spaCy (and we are leaving prodigy train in favor of spacy train)

Couldn't ner.silver-to-gold convert these Yes/No into regular annotations, which are then usable with spacy train? See 'use case' in https://prodi.gy/docs/recipes#ner-silver-to-gold.

FrancescoCasalegno commented 3 years ago

Hello @pafonta,

The model is updated in the loop, according to the documentation and the code of prodigy.recipes.ner.teach (i.e. calls to model.update).

I think you are right in fact! I will correct my comment above with a strikethrough.

Couldn't ner.silver-to-gold convert these Yes/No into regular annotations [...]?

Yes absolutely! This is indeed what we do in our Process:

But notice that this ner.silver-to-gold recipe still requires manual intervention, see more details here.

FrancescoCasalegno commented 3 years ago

First version Done. Should there be any ideas to improve the process, we'll create a dedicated Issue and upgrade the version.

BlueBrain / Search

First draft for NER models improvement processes #342

Context

Ideas for this process

Actions

Version 1.1

New version

What's New?