Closed FrancescoCasalegno closed 2 years ago
We tried to summarize different approaches in a sketch
Here are some examples
It would be good to hear your thoughts.
bionlp13cg
has 16 entity types.
AMINO_ACID, ANATOMICAL_SYSTEM, CANCER, CELL, CELLULAR_COMPONENT, DEVELOPING_ANATOMICAL_STRUCTURE, GENE_OR_GENE_PRODUCT, IMMATERIAL_ANATOMICAL_ENTITY, MULTI-TISSUE_STRUCTURE, ORGAN, ORGANISM, ORGANISM_SUBDIVISION, ORGANISM_SUBSTANCE, PATHOLOGICAL_FORMATION, SIMPLE_CHEMICAL, TISSUE
After trying a simple 1:1 correspondence between our entity types and the entity types of the model. | Our entity type | Model entity type |
---|---|---|
GENE | GENE_OR_GENE_PRODUCT | |
CELL_TYPE | CELL | |
BRAIN_REGION | ANATOMICAL_SYSTEM | |
CELL_COMPARTMENT | CELLULAR_COMPONENT | |
ORGANISM | ORGANISM |
Here are the results we obtain without any fine tuning: | precision | recall | f1-score | support | |
---|---|---|---|---|---|
BRAIN_REGION | 0.23 | 0.18 | 0.20 | 345 | |
CELL_COMPARTMENT | 0.17 | 0.45 | 0.25 | 177 | |
CELL_TYPE | 0.28 | 0.48 | 0.35 | 677 | |
GENE | 0.55 | 0.67 | 0.60 | 1469 | |
ORGANISM | 0.28 | 0.46 | 0.34 | 279 |
Note: I tried to find the raw dataset (in NER format) but it seems complicated to find.
We experimented with the M1 approach (replacing O
in partially annotated datasets with IGNORE) and used the following external datasets
GENE
CELL_TYPE
+ GENE
ORGANISM
We took a random stratified split of our fully-annotated dataset. See below the definition of each of the datasets/models
internal
- only trained on train
samples of our internal fully-annotated datasetexternal_2
- train
samples of our internal fully-annotated dataset + bc2gm
+ jnlpba
external_3
- train
samples of our internal fully-annotated dataset + bc2gm1
+ jnlpba
+ species_800
Test set performance
Train set (internal) performance
Tricky points/issues
Discussed during 26-07 meeting TO DOs:
K-fold cross-validation with 5 folds
K-fold cross-validation with 5 folds using the original (not corrected) annotations.
external data = bc2gm_corpus
and jnlpba
internal
- only trained on fully annotated dataexternal_simul
- fully annotated data + external data were concatenated and the network was trained on this dataset
external_seq
- we first trained in external data and then trained on fully annotated internal data (sequential logic)K-fold cross-validation with 5 folds using the corrected annotation. The rest is the same as above
Decision
Context
Actions
-100
to mask tokens so thetorch
loss function ignores them?