Open geethaRam opened 4 years ago
Same initial question - how do I run NER on raw sentences? Which pre-processor scripts should be used to have my data presented in format similar to test.tsv?
I'm also facing the same issue. Any pointers on this? Did you find a workaround? @geethaRam
@jhyuklee @wonjininfo Any help ? This should be better documented, it is very difficult to grasp how to use biobert for anything other than benchmarking on the datasets you provided.
@Rkubinski @gserb-datascientist
What I did was to adopt the run_ner.py script for model inference. I had to work around the TFRecordDataset
based input parameters and use raw tensor slices.
@wenshutang I think our question is exactly about how to generate these tokenized raw tensor slices in a way that works for run_ner. How did you tokenize your text ?
@Rkubinski We've been working on the PyTorch version of BioBERT which should be easier to modify for your datasets. You can see them in https://github.com/dmis-lab/biobert-pytorch. Thanks.
@Rkubinski sorry about the delayed reply, in case you are still wondering. I modified the input function builder to create tensor slices for inference.
Biobert-pytorch looks great 👍 , super helpful.
def input_fn_builder(features, seq_length):
"""Creates an `input_fn` closure to be passed to TPUEstimator."""
all_input_ids = []
all_input_mask = []
all_segment_ids = []
all_label_ids = []
for feature in features:
all_input_ids.append(feature.input_ids)
all_input_mask.append(feature.input_mask)
all_segment_ids.append(feature.segment_ids)
all_label_ids.append(feature.label_ids)
def input_fn(params):
batch_size = params["batch_size"]
num_examples = len(features)
d = tf.data.Dataset.from_tensor_slices(
{
# "unique_ids":
# tf.constant(all_unique_ids, shape=[num_examples], dtype=tf.int32),
"input_ids": tf.constant(all_input_ids, shape=[num_examples, seq_length], dtype=tf.int64),
"input_mask": tf.constant(all_input_mask, shape=[num_examples, seq_length], dtype=tf.int64),
"segment_ids": tf.constant(all_segment_ids, shape=[num_examples, seq_length], dtype=tf.int64),
"label_ids": tf.constant(all_label_ids, shape=[num_examples, seq_length], dtype=tf.int64),
}
)
d = d.batch(batch_size=batch_size, drop_remainder=False)
return d
return input_fn
@jhyuklee @wenshutang Thank you guys, I appreciate it!
Was able to train and fine-tune the Biobert model for NER task. The validation also worked. Now, looking to use this fine-tuned model to run batch/real-time inference functions.
run_ner.py always expects to run prediction only from test.tsv
Can you provide additional instructions in getting the input sequence (Eg: sequence="This is a sample test to test diseases and disorders") to the test.tsv format ? Or share a script that will tokenize and format the input sequences to the format that run_ner.py expects for prediction ?
Also, another clarification: The label_list as specified here : https://github.com/dmis-lab/biobert/blob/master/run_ner.py#L192 ["[PAD]", "B", "I", "O", "X", "[CLS]", "[SEP]"] So, the num_labels is 7 , correct ? Asking because when loading the Biobert's fine-tuned model into hugging face transformers AutoModelForTokenClassification - it is only predicting in binary labels. No luck in loading the model into transformers' library as pretrainedmodel.
Any help is appreciated