dmis-lab / biobert-pytorch

PyTorch Implementation of BioBERT
http://doi.org/10.1093/bioinformatics/btz682
Other
299 stars 104 forks source link

Predictions in raw data #5

Open GuillermoJaca opened 3 years ago

GuillermoJaca commented 3 years ago

Hello, I am wondering how predictions on raw data can be done. It is not documented at all for this and I think it's the primary use of the model.

jhyuklee commented 3 years ago

Hi @GuillermoJaca, what do you mean by the raw data? I think the pre-processing will depend on the type of task you want.

GuillermoJaca commented 3 years ago

I mean a normal biomedical text. The issue is that there is no .predict function, so the file run_ner.py has to be customized. What is the best way to do that? Which preprocessing should I use to get the best possible performance of the model taking into account that my task is NER ?

mgavish commented 3 years ago

Instruction on using the repo for inference is in the README under the NER section: https://github.com/dmis-lab/biobert#user-content-named-entity-recognition-ner:~:text=You%20can%20change%20the%20arguments%20as,using%20%2D%2Ddo_train%3Dfalse%20%2D%2Ddo_predict%3Dtrue%20for%20evaluating%20test.tsv.

The bigger challenge is completing inference without using the repo, ie, repo specific functions and methods.

abhibisht89 commented 3 years ago

@GuillermoJaca for prediction you can directly use your fine tune model in huggingface transformer pipeline, some sample code below for you reference:

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("finetue_model_path")
model = AutoModelForTokenClassification.from_pretrained("finetue_model_path")
nlp=pipeline(task='ner',model=model,tokenizer=tokenizer,grouped_entities=True,ignore_subwords=True)
text="""he is feeing very sick"""
output=nlp(text)

Read more here on huggingface pipeline: https://huggingface.co/transformers/main_classes/pipelines.html

nowhyun commented 3 years ago

@abhibisht89 Thank you for your reply.

However, if tokenizer is specified as 'dmis-lab/biobert-v1.1', the ignore_subwords option cannot be specified as True.

Is there any other way?

cutejue commented 3 years ago

Hello, I wonder why the labels are the simple BIO in NER task, however, in the raw dataset (e.g. NCBI), the labels could be SpecificDisease, Modifier and so on.