ThilinaRajapakse / pytorch-transformers-classification

Based on the Pytorch-Transformers library by HuggingFace. To be used as a starting point for employing Transformer models in text classification tasks. Contains code to easily train BERT, XLNet, RoBERTa, and XLM models for text classification.
Apache License 2.0
304 stars 97 forks source link

How to use model for making predictions? #6

Open adityakapri opened 5 years ago

adityakapri commented 5 years ago

Once the model has been rained how to do prediction using this?I have examples with no labels, i need to find all the predicted labels .

ThilinaRajapakse commented 5 years ago

Easiest way to do it would probably be something like this. I am setting label to 0 for all the examples, but the labels will not be used.

def tokenize(all_data):
    test_examples = [InputExample(0, sentence, None, '0') for sentence in all_data]
    label_list = ["0", "1"]

    num_labels = len(label_list)
    test_examples_len = len(test_examples)
    label_map = {label: i for i, label in enumerate(label_list)}

    test_features = convert_examples_to_features(test_examples, label_list, max_seq_len, tokenizer, output_mode,
        cls_token_at_end=bool('model_type' == 'xlnet'),            # xlnet has a cls token at the end
        cls_token=tokenizer.cls_token,
        cls_token_segment_id=2 if 'model_type' == 'xlnet' else 0,
        sep_token=tokenizer.sep_token,
        sep_token_extra=bool('model_type' == 'roberta'),
        pad_on_left=True,                 # pad on the left for xlnet
        pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0],
        pad_token_segment_id= 4 if 'model_type' == 'xlnet' else 0)

    all_input_ids = torch.tensor([f.input_ids for f in test_features], dtype=torch.long)
    all_input_mask = torch.tensor([f.input_mask for f in test_features], dtype=torch.long)
    all_segment_ids = torch.tensor([f.segment_ids for f in test_features], dtype=torch.long)
    all_label_ids = torch.tensor([f.label_id for f in test_features], dtype=torch.long)

    test_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids)
    return test_data

def get_predictions(model, test_data):
    model.eval()
    test_sampler = SequentialSampler(test_data)
    eval_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=eval_batch_size)
    preds = None
    for batch in eval_dataloader:
        with torch.no_grad():
            batch = tuple(t for t in batch)
            inputs = {'input_ids': batch[0],
                  'attention_mask': batch[1],
                  'token_type_ids': batch[2],
                  'labels': batch[3]}

            outputs = model(**inputs)
            _, logits = outputs[:2]
        if not preds:
            preds = logits.detach().numpy()
        else:
            preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)

        preds = np.argmax(preds, axis=1)

    return preds

You can use the tokenize() function to prepare the data, send it to get_predictions() and collect the predictions.

There may be cleaner ways of doing this but it didn't seem worth the trouble for me (the class specification for InputExample says label can be set to None for test data, but that would also require a lot more changes to the code). These two functions are adapted from something similar I wrote for an API that generates predictions. The API is working, so the approach is sound. However, I haven't tested the specific code I provided here, so let me know if it throws any bugs and I can see about fixing them.

Magpi007 commented 4 years ago

Is not the get_mismatched function taking out wrong predictions? It could be possible to just adjust this function to get both right and wrong preds?

ThilinaRajapakse commented 4 years ago

It's certainly possible. It's original purpose was to give insight into examples that the model was getting wrong.

Mahhos commented 4 years ago

I've got two questions.

  1. what is the format of all_data in def tokenize(all_data): function? Is it ".tsv" file in the same format as "train.tsv" and "dev.tsv"?
  2. Where to put these functions and how should we call these functions?
Mahhos commented 4 years ago

When I am running the tokenize function, I am getting ValueError: Number of processes must be at least 1. However, when I print os.cpu_count() it shows 2. Do you have any idea why?

djSharma7 commented 4 years ago

Can we get classification results on the basis of labels along with their polarities. For example- The product is good, but the price is very high.. Results -- Product -Positive (Polarity) Price - Negative (Polarity)