BMDSoftware / neji

Flexible and powerful platform for biomedical information extraction from text
39 stars 23 forks source link

Trained Model Recognizes Nothing #9

Closed wangmj17 closed 7 years ago

wangmj17 commented 7 years ago

Hi,

I just tried to train a model by "./nejiTrain.sh -a example/train/annotations -c example/train/sentences -f example/train/bw_o2_windows.config -if BC2 -m mymodel -o mymodel -t 11". However, when I was trying to use trained model to annotate by "./neji.sh -i example/annotate/in/ -o example/annotate/out/ -if RAW -of A1 -m mymodel/mymodel", the model didn't recognize anything (the output is empty). Could you help figure out why?

Thanks

davidcampos commented 7 years ago

@wangmj17 By default Neji does not provide annotations without associated identifiers. Since you are not providing normalization dictionaries for the ML model, no annotations are provided. Thus, please use the option "-noids" to obtain the annotations without identifiers.

wangmj17 commented 7 years ago

@davidcampos Thanks so much. I tried to use my own training data to train the model. During the training phase, everything was fine and fscore of 0.999+ was achieved. However, when I apply the trained model to annotate the training data (to check whether the trained model works or not), the annotation was totally wrong (or empty). I attached my training data and config file here. Could you help me find out why?

train.zip

bw_o2_windows.config.zip

(I changed every setting in the config file to "xxxx=1" so as to achieve a high f-score.)

The command I used: "./nejiTrain.sh -a train/annotations -c train/sentences -f bw_o2_windows.config -if BC2 -m mymodel -o mymodel -t 11" "./neji.sh -i input -o output -if RAW -of A1 -m mymodel/mymodel -noids -t 5"

Thanks in advance!

aleixomatos commented 7 years ago

It may be that you are supplying too few training data and your model, which uses many features, is overfitted.

wangmj17 commented 7 years ago

@aleixomatos I am applying the trained model to annotate training data, not test data. So even if there is overfitting, the annotation on training data should be correct.

davidcampos commented 7 years ago

@wangmj17 I believe that using all the features will not deliver the best results. Please have look at the paper about Gimli (the ML part of Neji), were this subject is discussed in the "Feature set" section: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-54

Using the data that you provided, I tried training a model with the following configuration (order 2 will deliver better results):

token=1
stem=1
lemma=1
pos=1
chunk=1
nlp=0
capitalization=1
counting=1
symbols=1
ngrams=1
suffix=1
prefix=1
morphology=1
greek=1
roman=1
prge=0
concepts=0
verbs=1
window=1
conjunctions=0
order=1
parsing=FW
entity=ENTITY

I observed problems in the training phase, with the following evaluation results:

INFO: OVERALL
Feb 26, 2017 10:32:57 AM cc.mallet.fst.MultiSegmentationEvaluator evaluateInstanceList
INFO:  train segments true=3317 pred=1873 correct=1774 misses=1543 alarms=99
Feb 26, 2017 10:32:57 AM cc.mallet.fst.MultiSegmentationEvaluator evaluateInstanceList
INFO:  train precision=0.9471 recall=0.5348 f1=0.6836

F1 of 0.6836 may indicate problems in the annotations. Could you please double check if the annotations character positions are provided correctly to the training phase?

Nevertheless, I annotated the example files using this model and I was able to get annotations in the output:

T0  ENTITY 83 102   4 chronic hepatitis
N0  Reference T0 :::ENTITY  4 chronic hepatitis
wangmj17 commented 7 years ago

@davidcampos I doubled checked the annotations and the character position is correct. Actually the training data is provided by GNormPlus (https://www.researchgate.net/publication/282038998_GNormPlus_An_Integrative_Approach_for_Tagging_Genes_Gene_Families_and_Protein_Domains) and are used to train their model (also based on CRF).

davidcampos commented 7 years ago

@wangmj17 Following the information provided in https://github.com/BMDSoftware/neji/wiki/Formats#bc2:

The annotations file should contain one annotation per line, which follows the following format: SENTENCE_ID|FIRST_CHAR LAST_CHAR|TEXT. The character counting used for the FIRST_CHAR and LAST_CHAR, must be performed discarding white spaces.

Are you discarding white spaces counting?

wangmj17 commented 7 years ago

@davidcampos This is the very problem. After discarding whitespaces, it works now. Thank you very much!

davidcampos commented 7 years ago

@wangmj17 Happy to help!