BlueBrain / Search

Blue Brain text mining toolbox for semantic search and structured information extraction
https://blue-brain-search.readthedocs.io
GNU Lesser General Public License v3.0
42 stars 11 forks source link

Try improving NER model performance by using publicly available NER datasets #608

Closed FrancescoCasalegno closed 2 years ago

FrancescoCasalegno commented 2 years ago

Context

Actions

jankrepl commented 2 years ago

We tried to summarize different approaches in a sketch Screenshot 2022-07-13 at 14 45 30

Here are some examples Screenshot 2022-07-13 at 14 48 26

It would be good to hear your thoughts.

EmilieDel commented 2 years ago

bionlp13cg has 16 entity types. AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE

After trying a simple 1:1 correspondence between our entity types and the entity types of the model. Our entity type Model entity type
GENE GENE_OR_GENE_PRODUCT
CELL_TYPE CELL
BRAIN_REGION ANATOMICAL_SYSTEM
CELL_COMPARTMENT CELLULAR_COMPONENT
ORGANISM ORGANISM
Here are the results we obtain without any fine tuning: precision recall f1-score support
BRAIN_REGION 0.23 0.18 0.20 345
CELL_COMPARTMENT 0.17 0.45 0.25 177
CELL_TYPE 0.28 0.48 0.35 677
GENE 0.55 0.67 0.60 1469
ORGANISM 0.28 0.46 0.34 279

Note: I tried to find the raw dataset (in NER format) but it seems complicated to find.

jankrepl commented 2 years ago

We experimented with the M1 approach (replacing O in partially annotated datasets with IGNORE) and used the following external datasets

We took a random stratified split of our fully-annotated dataset. See below the definition of each of the datasets/models

Test set performance

Screenshot 2022-07-22 at 09 42 15

Train set (internal) performance

Screenshot 2022-07-22 at 09 22 18

Tricky points/issues

EmilieDel commented 2 years ago

Discussed during 26-07 meeting TO DOs:

jankrepl commented 2 years ago

K-fold cross-validation with 5 folds

F1 score

Test

Screenshot 2022-08-02 at 09 38 32

Train

Screenshot 2022-08-02 at 09 38 51
FrancescoCasalegno commented 2 years ago

Planning 2022-08-02

jankrepl commented 2 years ago

K-fold cross-validation with 5 folds using the original (not corrected) annotations.

external data = bc2gm_corpus and jnlpba

F1 score

Test

Screenshot 2022-08-12 at 11 19 05

Train

Screenshot 2022-08-12 at 11 23 05

jankrepl commented 2 years ago

K-fold cross-validation with 5 folds using the corrected annotation. The rest is the same as above

F1-score

Train

Screenshot 2022-08-15 at 13 35 51

Test

Screenshot 2022-08-15 at 13 35 43

FrancescoCasalegno commented 2 years ago

Update 2022-08-16

Decision