FrancescoCasalegno commented 2 years ago

Context

In #605 we tried to improve our NER model by leveraging the NER-annotated sentences generated from the Ontology.
However, that didn't work because the quality of the annotations was too poor.
Instead, we can think of using any of the publicly available datasets for NER with biological entity types.
For instance (but not limited to) we can look to the corpora used by SciSpaCy to train their models: CRAFT, JNLPBA, BC5CDR, BIONLP13CG.

Actions

[ ] Find publicly available annotated NER datasets that cover some of the entity types we also want.
[ ] Think about how to handle the fact that some of those dataests may not have annotations for all our entity types, and at the same time they may have annotations for entity types we do not care about.
- use label -100 to mask tokens so the torch loss function ignores them?
- train one NER model per entity type? but then how to resolve conflicts?
[ ] See if learning curves show higher performance (e.g. better intercept + same slope) than what we got in #601 and #602. To have comparable results, we could do the following (?):
- train on 1/8 of our data + all external datasets
- train on 2/8 of our data + all external datasets
- train on 4/8 of our data + all external datasets
- train on 8/8 of our data + all external datasets

jankrepl commented 2 years ago

List of relevant datasets: https://corposaurus.github.io/corpora/
Literature on approaches how to handle partially annotated datasets in NER:
- They basically treat the problem as multi-label classification (each entity type will have a separate binary classifier) arXiv researchgate
- More theoretical, they assume that we have a CRF model (which we don't): https://arxiv.org/abs/2005.00502

We tried to summarize different approaches in a sketch Screenshot 2022-07-13 at 14 45 30

Here are some examples Screenshot 2022-07-13 at 14 48 26

It would be good to hear your thoughts.

EmilieDel commented 2 years ago

bionlp13cg has 16 entity types. AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE

After trying a simple 1:1 correspondence between our entity types and the entity types of the model.	Our entity type	Model entity type
GENE	GENE_OR_GENE_PRODUCT
CELL_TYPE	CELL
BRAIN_REGION	ANATOMICAL_SYSTEM
CELL_COMPARTMENT	CELLULAR_COMPONENT
ORGANISM	ORGANISM

Here are the results we obtain without any fine tuning:		precision	recall	f1-score
BRAIN_REGION	0.23	0.18	0.20	345
CELL_COMPARTMENT	0.17	0.45	0.25	177
CELL_TYPE	0.28	0.48	0.35	677
GENE	0.55	0.67	0.60	1469
ORGANISM	0.28	0.46	0.34	279

Note: I tried to find the raw dataset (in NER format) but it seems complicated to find.

jankrepl commented 2 years ago

We experimented with the M1 approach (replacing O in partially annotated datasets with IGNORE) and used the following external datasets

https://huggingface.co/datasets/bc2gm_corpus - GENE
https://huggingface.co/datasets/jnlpba - CELL_TYPE + GENE
https://huggingface.co/datasets/species_800 - ORGANISM

We took a random stratified split of our fully-annotated dataset. See below the definition of each of the datasets/models

internal - only trained on train samples of our internal fully-annotated dataset
external_2 - train samples of our internal fully-annotated dataset + bc2gm + jnlpba
external_3 - train samples of our internal fully-annotated dataset + bc2gm1 + jnlpba + species_800

Test set performance

Train set (internal) performance

Tricky points/issues

It seems like overfitting the internal training set is actually not a terrible strategy to get good results on the internal test set. This IMO suggests that it is really hard to make any conclusions about generalization
The M1 scheme effectively introduces a huge class imbalance
Our internal fully-annotated dataset (~200 training samples) is tiny compared to the external ones (50,000 + samples). We did not assign bigger sample weights to our internal samples and IMO the model might not care about them that much during training

EmilieDel commented 2 years ago

Discussed during 26-07 meeting TO DOs:

[x] k-fold on the internal dataset and compute means of experiments
[ ] Train on the external (partially annotated) dataset and then "fine-tune" on the internally (fully annotated) datasets
[ ] (Less important) Training phase 1 with M2 approach

jankrepl commented 2 years ago

K-fold cross-validation with 5 folds

F1 score

Test

Train

FrancescoCasalegno commented 2 years ago

Planning 2022-08-02

[x] Look at results of train + eval (k-fold cross-validation) after #607 fixes the annotations in the "ground truth"
[x] Try to "pre-train" on the external (partially annotated) NER dataset and then "fine-tune" on the internally (fully annotated) NER datasets

jankrepl commented 2 years ago

K-fold cross-validation with 5 folds using the original (not corrected) annotations.

external data = bc2gm_corpus and jnlpba

internal - only trained on fully annotated data
external_simul - fully annotated data + external data were concatenated and the network was trained on this dataset
- The reason why the performance is worse than what was shown in the previous post is that this time the validation set consisted both of external data and internal data (before it was just the internal data)
external_seq- we first trained in external data and then trained on fully annotated internal data (sequential logic)

F1 score

Test

Screenshot 2022-08-12 at 11 19 05

Train

Screenshot 2022-08-12 at 11 23 05

jankrepl commented 2 years ago

K-fold cross-validation with 5 folds using the corrected annotation. The rest is the same as above

F1-score

Train

Screenshot 2022-08-15 at 13 35 51

Test

Screenshot 2022-08-15 at 13 35 43

FrancescoCasalegno commented 2 years ago

Update 2022-08-16

Based on the results shown in https://github.com/BlueBrain/Search/issues/608#issuecomment-1212909088, it seems that merging the partially annotated NER samples with the fully annotated ones from BBP gives bad results. Possibly, this is because the partially annotated ones outnumber the fully annotated, high quality ones.
Based on the results shown in https://github.com/BlueBrain/Search/issues/608#issuecomment-1214915372, pre-training on the partially annotated NER samples does not decrease but neither it significantly improves the accuracy of the final NER model.

Decision

For the time being, it does not seem like we can leverage (partially annotated) publicly available NER datasets to improve the performance of our NER models.

BlueBrain / Search

Try improving NER model performance by using publicly available NER datasets #608

Context

Actions

F1 score

Test

Train

Planning 2022-08-02

F1 score

Test

Train

F1-score

Train

Test

Update 2022-08-16