dave-s477 / SoMeNLP

Information Extraction for Software Mentions in Scientific articles
MIT License
4 stars 2 forks source link

Using a pretrained BERTMultiTaskOpt2 model to do NER #4

Open MarcoAS99 opened 2 years ago

MarcoAS99 commented 2 years ago

I've trained a model following the guidelines provided in the README, in particular using the _--model-config configurations/PMC/NER/gold_multi_opt2SciBERT.json.

My problem is that once I have the saved model in the corresponding save file I don't know how to use the model with a text.

So my question is: How do I use a pretrained model to do name entity recognition given a text ?

dave-s477 commented 2 years ago

there is a script for this purpose under bin/predict. It expects either a list of files (as json) or a path to a folder as well as a suited configuration file.

I ran it previously with the following parameters: predict --file-list list.json --prepro False --out-path /location/to/write --model-config configurations/PMC/NER/pred_multi_opt2_SciBERT.json --bio-pred --offset 0 --limit 100

The model checkpoint has to be adjusted to your new path in the configuration file. There is also a pre-trained model available from Zenodo (https://zenodo.org/record/5780121) which we used for predictions: M_SB_sw_info_opt.pth

Thanks for pointing out that this is missing from the readme. I will re-test the script and add an entry on running predictions.

MarcoAS99 commented 2 years ago

I've tried what you're saying but i can't do a predict, I've changed the save function of the BERTMultiTaskOpt2 so that it saves in a transformers format because initially I was trying to predict a text manualy so my save folder has the following files: image

So when I try the predict command it gives me the following error: image

On the other hand with the manual approach what I currently have is this: image image

But the output that I'm trying to achieve is like the ones in the .ann files, I would like to have something like: "Developer 31 40 Microsoft"

Do you have any idea of how can i achieve this?

dave-s477 commented 2 years ago

The error seems to be due to an error in your configuration file. The code throwing the error should not be executed if an existing encoding is provided in the configuration file. The responsible entry in the json path should be general/checkpoint/save_dir.

Additionally, you will also have to change model loading (somenlp/NER/model_wrapper.py function ModelWrapper.load()) if you have changed the model saving.

The output is set up as IOB2 format, but I have have written a script to transform it to the BRAT stand-off format (corresponding to .ann files) in another package: https://github.com/dave-s477/articlenizer with the script being available as bin/bio_to_brat. I have also applied it to output from the software prediction before.

Important Addition: there is steps missing in-between. bin/predict generates separate output files for all classification targets, and also contains the Wordpiece token split form the BERT tokenizer (e.g. ORSEE -> OR + ##SE + ##E). bin/combine_annotations can be run to combine them and summarize BERT token splits.

You may further consider running the relation extraction step bin/predict_relext on the combined output to also get relations with the final BRAT output, so this is not necessary.