BlueBrain / Search

Blue Brain text mining toolbox for semantic search and structured information extraction
https://blue-brain-search.readthedocs.io
GNU Lesser General Public License v3.0
40 stars 10 forks source link

First draft for NER models improvement processes #342

Closed FrancescoCasalegno closed 3 years ago

FrancescoCasalegno commented 3 years ago

Context

As we have been requested, it is of highest importance that not only our NER models improve their accuracy, but also that we implement features and define processes make it as seamless as possible to improve our NER models by allowing users to address the two following use cases.

  1. Add support for a new entity types.
  2. Correct errors observed in predictions.

Ideas for this process

Actions

FrancescoCasalegno commented 3 years ago

The following resources have been created.

EmilieDel commented 3 years ago

The following resources have been created.

@Stannislav @pafonta @jankrepl Feel free to review those resources and give feedback.

pafonta commented 3 years ago

Hello @EmilieDel and @FrancescoCasalegno,

Feel free to review those resources and give feedback.

Awesome work on coming with a defined workflow (PDF diagram) and an how-to (Confluence page)!


Regarding the questions in the PDF diagram on ner.manual / ner.correct vs ner.teach:

Why not distinguishing them on their usage?

So, when the training is:

  1. from scratch without any existing model: ner.manual.
  2. from scratch but an existing model gives somehow good suggestions: ner.correct.
  3. continued from an existing model: ner.teach.

Besides, I would have the following questions:

  1. Maybe we could number the diagram boxes to help us refer to them?

  2. If we use sense2vec (done in the Confluence counterpart), I think that training / improving such a model for neuroscience should be part of the diagram (PDF).

  3. Why making the steps ner.teach and ner.correct optional before deploying a NER model in production?

  4. The box Can we define an explicit pattern to fix the error with a Rule Based NER step? has no alternative no. If that's intentional, maybe it doesn't make sense to have that box, then.

EmilieDel commented 3 years ago

Hello @pafonta,

Maybe @FrancescoCasalegno can answer better but I can say:

So, when the training is: from scratch without any existing model: ner.manual. from scratch but an existing model gives somehow good suggestions: ner.correct. continued from an existing model: ner.teach.

  1. Yes it is something we can do
  2. I am not sure I understand your question here. sense2vec is used to extract a list of patterns that are going to be pre-highlighted with the recipe ner.manual (sense2vec.teach and sense2vec.to-patterns/terms.to-patterns). There is no real training in this step. Moreover, this step is also represented in the diagram (it is the first blue block of the first use case).
  3. Because if the model is good enough after the first ner.manual or even an entity ruler can make the NER, it is maybe not needed to ask scientists more annotations.
  4. The alternative for no regarding that box is not well defined yet. Indeed, it is part of the questions on the left. Should we try to create a new model for this case or a binary classifier and how? This part is not decided yet.
pafonta commented 3 years ago

Hello @EmilieDel !

Thank you for your answers.

So, when the training is:

  1. from scratch without any existing model: ner.manual.
  2. from scratch but an existing model gives somehow good suggestions: ner.correct.
  3. continued from an existing model: ner.teach.
I think it is the implicit idea of it but maybe it is easier to keep the current logic for the scientist.

But then, the diagram is currently saying that extracting a new entity would always require to train a model from scratch.

Is it intended? I would argue that we would want not to start from scratch each time.

1. Yes it is something we can do

Thanks!

2. I am not sure I understand your question here. `sense2vec` is used to extract a list of patterns that are going to be pre-highlighted with the recipe `ner.manual` (`sense2vec.teach` and `sense2vec.to-patterns`/`terms.to-patterns`). There is no real training in this step. Moreover, this step is also represented in the diagram (it is the first blue block of the first use case).

sense2vec uses vectors that need to be trained. As said in the Confluence page (i.e. Can't find seed term 'astrocyte' in vector), this is expected that the trained vectors are not relevant for our use case. Indeed, some words could be missing from the available vectors or have a meaning different from their neuroscience one.

3. Because if the model is good enough after the first `ner.manual` or even an entity ruler can make the NER, it is maybe not needed to ask scientists more annotations.

I would argue that using ner.teach or ner.correct is a good (only?) way to make sure the model corresponds to the expectations from the scientists. Indeed, if the scientists have to reject a lot of the model predictions, then the model is not ready to be deployed in production.

Besides, I would also argue that doing like this could bring a neat solution to part of the questions about the test set (questions 4 and 5 in red on the diagram) while improving the underlying model.

4. The alternative for `no` regarding that box is not well defined yet. Indeed, it is part of the questions on the left. Should we try to create a new model for this case or a binary classifier and how? This part is not decided yet.

Oh. Is it then the question 3 in red on the diagram?

jankrepl commented 3 years ago

Really amazing job! Thanks for trying to write down the exact process.

I have some questions + comments on what I think is missing (if you addressed them already, sorry in advance).

  1. Maybe unrealistic, but when we are given a new entity type manual annotation is not the only option. We can actually check online whether there are supervised datasets publicly available containing that entity type. Note that this is very much related to starting from some pretrained model vs blank model.
  2. IMO we should also pay attention to how we store the JSONL in some nice and systematic way. Currently, we just dump them all inside of data_and_models/annotations and use the filename to encode metadata.
  3. What about the interrater agreement? I guess it would be nice to have an upper bound on what the performance of the model could be. Also it would be nice present our results as entity ruler < our model <= other human.
EmilieDel commented 3 years ago

Hello @pafonta, @jankrepl,

Thanks for your feedback! (and sorry in advance for my long answer! 😅)

I think you made similar point concerning starting from scratch model or from some pretrained one. We can definitely add this possibility into the diagram.

Regarding external resources, I think there are indeed two main levels it could help:


But then, the diagram is currently saying that extracting a new entity would always require to train a model from scratch. Is it intended? I would argue that we would want not to start from scratch each time.

Ideally, if there are models outside, it would be great. But do you think it is going to be the case ? I have the feeling we have bigger chances to find annotated datasets than directly model trained. But I agree, we can definitely add this possibility into the diagram.

sense2vec uses vectors that need to be trained. As said in the Confluence page (i.e. Can't find seed term 'astrocyte' in vector), this is expected that the trained vectors are not relevant for our use case. Indeed, some words could be missing from the available vectors or have a meaning different from their neuroscience one.

I am seeing this slightly differently I think. In my opinion, the first (and biggest) source to create the desired pattern lists is going to be online resources (ontology, ...). For me, the sense2vec step is really here to help to increase the number of patterns. Moreover, this entire step (=creation of a pattern list) is really here:

However, it is a really good point that training of the vectors could be needed/useful and should be considered. But I can imagine it is going to be much longer and done once or maybe from time to time, not every time we need to train a new entity type. What do you think ?

I would argue that using ner.teach or ner.correct is a good (only?) way to make sure the model corresponds to the expectations from the scientists. Indeed, if the scientists have to reject a lot of the model predictions, then the model is not ready to be deployed in production.

There are pros and cons in my opinion. If the number of annotations is good enough (always debatable for sure), splitting annotations into train and test sets should be enough to have a fair evaluation. I don't think we always need to ask annotators to directly correct the model.

Besides, I would also argue that doing like this could bring a neat solution to part of the questions about the test set (questions 4 and 5 in red on the diagram) while improving the underlying model.

For me, correcting the model through ner.teach and ner.correct recipes can be very useful to have a bigger number of annotations (and then help the training) but is also already a bit biased, so maybe not the most suitable to create a test set.

Oh. Is it then the question 3 in red on the diagram?

Yes it is this question and maybe in a more general perspective: should we go for a binary classifier or train a new NER model.


Maybe unrealistic, but when we are given a new entity type manual annotation is not the only option. We can actually check online whether there are supervised datasets publicly available containing that entity type. Note that this is very much related to starting from some pretrained model vs blank model.

It is a really good point, we can definitely add a step to integrate this possibility.

IMO we should also pay attention to how we store the JSONL in some nice and systematic way. Currently, we just dump them all inside of data_and_models/annotations and use the filename to encode metadata.

Good observation. I think the decision to go to one entity type per NER model is going to make thinks easier. But we definitely need to decide on some convention.

What about the interrater agreement? I guess it would be nice to have an upper bound on what the performance of the model could be. Also it would be nice present our results as entity ruler < our model <= other human.

Yes, that is one of the questions that are still to investigate (see question 4 on the diagram). It would be ideal to have this interrater for sure!

pafonta commented 3 years ago

Hello @EmilieDel,

I would argue that we would want not to start from scratch each time.

Ideally, if there are models outside, it would be great. But do you think it is going to be the case ? I have the feeling we have bigger chances to find annotated datasets than directly model trained. But I agree, we can definitely add this possibility into the diagram.

Most of the available NER models are for our domain, the biomedical domain.

Most of the available annotated datasets are research benchmarks. There is then a high probability that there are corresponding models available and even achieving SOTA.

So, yes, I would say that we would have base models in most of the case. But of course, that depends on the final list of entities to recognize.

NB: #320 would help us knowing if an existing model would suit our needs for some entities.

sense2vec uses vectors that need to be trained. As said in the Confluence page (i.e. Can't find seed term 'astrocyte' in vector), this is expected that the trained vectors are not relevant for our use case. Indeed, some words could be missing from the available vectors or have a meaning different from their neuroscience one.

However, it is a really good point that training of the vectors could be needed/useful and should be considered. But I can imagine it is going to be much longer and done once or maybe from time to time, not every time we need to train a new entity type. What do you think ?

We would need to re-train each time we add significant papers to the literature database. Indeed, it would help us prevent semantic shift.

Besides, using general pretrained vectors like the ones from Reddit would reinforce the "obvious" while we would want to capture the difficult cases during annotation by experts. One practical side-effect is that we could conclude that a rule-based model, using patterns from these general vectors and evaluated on a test set also selected with these vectors, is good for production, while in production it has catastrophic performances. We have seen this, for example, with the need to manually clean-up some totally unrelated entities recognized by our models (data_and_models/annotations/ner/rule_based_patterns.jsonl) like smartphone or taxi.

I would argue that using ner.teach or ner.correct is a good (only?) way to make sure the model corresponds to the expectations from the scientists. Indeed, if the scientists have to reject a lot of the model predictions, then the model is not ready to be deployed in production.

There are pros and cons in my opinion. If the number of annotations is good enough (always debatable for sure), splitting annotations into train and test sets should be enough to have a fair evaluation. I don't think we always need to ask annotators to directly correct the model.

Besides, I would also argue that doing like this could bring a neat solution to part of the questions about the test set (questions 4 and 5 in red on the diagram) while improving the underlying model.

For me, correcting the model through ner.teach and ner.correct recipes can be very useful to have a bigger number of annotations (and then help the training) but is also already a bit biased, so maybe not the most suitable to create a test set.

Not doing ner.teach nor ner.correct but building a test set is a good option too. I was just thinking that with this option, we would need to put extra efforts in building the test set. We can see from the benchmark papers in NLP/NLU that building a good one is not trivial. We had also seen this for the PATHWAY entity in #248, #319, #318.

Biased in which way(s)?

FrancescoCasalegno commented 3 years ago

Version 1.1

New version

See the new PDF version here: ner_improvementprocess-v1.1.pdf

Check out also the new Confluence page.

What's New?

This revision will try to address @jankrepl and @pafonta reviews and other points as well.

FrancescoCasalegno commented 3 years ago

@jankrepl and @pafonta can you have a look to see if we implemented all your requests?

jankrepl commented 3 years ago

@jankrepl and @pafonta can you have a look to see if we implemented all your requests?

Perfect! Thank you!

pafonta commented 3 years ago

Hello @FrancescoCasalegno!

can you have a look to see if we implemented all your requests?

I feel that my review was taken into account. Thank you!

Not to ask for change but more for comment: I think we could make more use of the active learning feature of Prodigy (ner.teach) to go faster to the best model. Indeed, at the moment (workflow v1.1), ner.teach is used as a 'last resort' method.

FrancescoCasalegno commented 3 years ago

Hi @pafonta,

First of all, thank you again for your feedback.

we could make more use of the active learning feature of Prodigy (ner.teach) to go faster to the best model.

Maybe to better justify our (current) decision to use ner.teach only after some annotations have already been collected, I can point out the following points.

In any case, consider that ner.teach has never been used in the past to collect annotations for Blue Brain Search—so I would say that as soon as we will execute this new process with our users we'll also be able to improve and modify the it based on what we will find out :)

pafonta commented 3 years ago

Hello @FrancescoCasalegno,

First of all, thank you again for your feedback.

:)

Thank you for the detailed clarifications!

The points make sense.

I have however a different understanding for the two following points.

ner.teach uses active learning in the sense that it chooses samples where the model is not too confident about the prediction—but this model is not updated in the loop au fur et à mésure that you give your feedback with ner.teach, so it probably makes sense to start with a model that already understands something

The model is updated in the loop, according to the documentation and the code of prodigy.recipes.ner.teach (i.e. calls to model.update).

I am not sure the same thing of --binary training is possible with spaCy (and we are leaving prodigy train in favor of spacy train)

Couldn't ner.silver-to-gold convert these Yes/No into regular annotations, which are then usable with spacy train? See 'use case' in https://prodi.gy/docs/recipes#ner-silver-to-gold.

FrancescoCasalegno commented 3 years ago

Hello @pafonta,

The model is updated in the loop, according to the documentation and the code of prodigy.recipes.ner.teach (i.e. calls to model.update).

I think you are right in fact! I will correct my comment above with a strikethrough.

Couldn't ner.silver-to-gold convert these Yes/No into regular annotations [...]?

Yes absolutely! This is indeed what we do in our Process:

Screenshot 2021-04-26 at 11 17 11

But notice that this ner.silver-to-gold recipe still requires manual intervention, see more details here.

FrancescoCasalegno commented 3 years ago

First version Done. Should there be any ideas to improve the process, we'll create a dedicated Issue and upgrade the version.