Running inference on Fine-tuned Biobert model for diseases

geethaRam commented 4 years ago

Was able to train and fine-tune the Biobert model for NER task. The validation also worked. Now, looking to use this fine-tuned model to run batch/real-time inference functions.

run_ner.py always expects to run prediction only from test.tsv

Can you provide additional instructions in getting the input sequence (Eg: sequence="This is a sample test to test diseases and disorders") to the test.tsv format ? Or share a script that will tokenize and format the input sequences to the format that run_ner.py expects for prediction ?

Also, another clarification: The label_list as specified here : https://github.com/dmis-lab/biobert/blob/master/run_ner.py#L192 ["[PAD]", "B", "I", "O", "X", "[CLS]", "[SEP]"] So, the num_labels is 7 , correct ? Asking because when loading the Biobert's fine-tuned model into hugging face transformers AutoModelForTokenClassification - it is only predicting in binary labels. No luck in loading the model into transformers' library as pretrainedmodel.

Any help is appreciated

gserb-datascientist commented 4 years ago

Same initial question - how do I run NER on raw sentences? Which pre-processor scripts should be used to have my data presented in format similar to test.tsv?

wenshutang commented 4 years ago

I'm also facing the same issue. Any pointers on this? Did you find a workaround? @geethaRam

Rkubinski commented 3 years ago

@jhyuklee @wonjininfo Any help ? This should be better documented, it is very difficult to grasp how to use biobert for anything other than benchmarking on the datasets you provided.

wenshutang commented 3 years ago

@Rkubinski @gserb-datascientist

What I did was to adopt the run_ner.py script for model inference. I had to work around the TFRecordDataset based input parameters and use raw tensor slices.

Rkubinski commented 3 years ago

@wenshutang I think our question is exactly about how to generate these tokenized raw tensor slices in a way that works for run_ner. How did you tokenize your text ?

jhyuklee commented 3 years ago

@Rkubinski We've been working on the PyTorch version of BioBERT which should be easier to modify for your datasets. You can see them in https://github.com/dmis-lab/biobert-pytorch. Thanks.

wenshutang commented 3 years ago

@Rkubinski sorry about the delayed reply, in case you are still wondering. I modified the input function builder to create tensor slices for inference.

Biobert-pytorch looks great 👍 , super helpful.

def input_fn_builder(features, seq_length):
    """Creates an `input_fn` closure to be passed to TPUEstimator."""
    all_input_ids = []
    all_input_mask = []
    all_segment_ids = []
    all_label_ids = []

    for feature in features:
        all_input_ids.append(feature.input_ids)
        all_input_mask.append(feature.input_mask)
        all_segment_ids.append(feature.segment_ids)
        all_label_ids.append(feature.label_ids)

    def input_fn(params):
        batch_size = params["batch_size"]
        num_examples = len(features)

        d = tf.data.Dataset.from_tensor_slices(
            {
                # "unique_ids":
                #     tf.constant(all_unique_ids, shape=[num_examples], dtype=tf.int32),
                "input_ids": tf.constant(all_input_ids, shape=[num_examples, seq_length], dtype=tf.int64),
                "input_mask": tf.constant(all_input_mask, shape=[num_examples, seq_length], dtype=tf.int64),
                "segment_ids": tf.constant(all_segment_ids, shape=[num_examples, seq_length], dtype=tf.int64),
                "label_ids": tf.constant(all_label_ids, shape=[num_examples, seq_length], dtype=tf.int64),
            }
        )
        d = d.batch(batch_size=batch_size, drop_remainder=False)
        return d

    return input_fn

Rkubinski commented 3 years ago

@jhyuklee @wenshutang Thank you guys, I appreciate it!

dmis-lab / biobert

Running inference on Fine-tuned Biobert model for diseases #110