allenai / kb

KnowBert -- Knowledge Enhanced Contextual Word Representations
Apache License 2.0
370 stars 50 forks source link

Binary Text Classification #13

Closed AliOsm closed 4 years ago

AliOsm commented 4 years ago

How to finetune KnowBERT for binary text classification tasks?

For example, given a sequence of words predict its polarity.

Thanks!

matt-peters commented 4 years ago

The Words-In-Context model is an example of binary classification -- predict whether the usage of a word in two contexts is the same sense or not. See kb/evaluation/wic_dataset_reader.py and training_config/downstream/wic.jsonnet.

AliOsm commented 4 years ago

Thanks for you reply!

Can you explain the following lines in kb/evaluation/wic_dataset_reader.py:

# get the indices of the marked words
# index in the original tokens
idx1, idx2 = [int(ind) for ind in tokens[2].split('-')]
offsets_a = [1] + token_candidates['offsets_a'][:-1]
idx1_offset = offsets_a[idx1]
offsets_b = [token_candidates['offsets_a'][-1] + 1] + token_candidates['offsets_b'][:-1]
idx2_offset = offsets_b[idx2]

fields['index_a'] = LabelField(idx1_offset, skip_indexing=True)
fields['index_b'] = LabelField(idx2_offset, skip_indexing=True)

I didn't understand them, are they related to the WiC task?

Thanks!

matt-peters commented 4 years ago

Yes these are related to the WiC task -- it's code to find the indices of the marked word in each sentence (see Table 1 of https://arxiv.org/abs/1808.09121)

AliOsm commented 4 years ago

So, the following is a valid data reader to load a single sentence and finetune the model using it, right?

from typing import Iterable
from allennlp.data.dataset_readers.dataset_reader import DatasetReader
from allennlp.data.fields import LabelField
from allennlp.data.instance import Instance
from kb.bert_tokenizer_and_candidate_generator import TokenizerAndCandidateGenerator
from allennlp.common.file_utils import cached_path

@DatasetReader.register("x")
class XDatasetReader(DatasetReader):
    def __init__(self,
                 tokenizer_and_candidate_generator: TokenizerAndCandidateGenerator,
                 entity_markers: bool = False):
        super().__init__()
        self.label_to_index = {'1': 1, '0': 0}
        self.tokenizer = tokenizer_and_candidate_generator
        self.tokenizer.whitespace_tokenize = True
        self.entity_markers = entity_markers

    def text_to_instance(self, line) -> Instance:
        raise NotImplementedError

    def _read(self, file_path: str) -> Iterable[Instance]:
        """Creates examples for the training and dev sets."""

        sentences = list()
        labels = list()

        with open(cached_path(file_path + '-x.csv'), 'r') as file:
            reader = csv.reader(file)

            for row in reader:
                sentences.append(row[1].strip())

        with open(cached_path(file_path + '-y.csv'), 'r') as file:
            reader = csv.reader(file)

            for row in reader:
                labels.append(int(row[1]))

        assert len(labels) == len(sentences), f'The length of the labels and sentences must match. ' \
            f'Got {len(labels)} and {len(sentences)}.'

        for line, label in zip(sentences, labels):
            text_a = line

            token_candidates = self.tokenizer.tokenize_and_generate_candidates(text_a)
            fields = self.tokenizer.convert_tokens_candidates_to_fields(token_candidates)
            fields['label_ids'] = LabelField(self.label_to_index[label], skip_indexing=True)

            instance = Instance(fields)

            yield instance
AliOsm commented 4 years ago

Also, I have one final question about training_config/downstream/wic.jsonnet file, can I remove the num_steps_per_epoch option or I should calculate and put it?

Thanks!

matt-peters commented 4 years ago

You should calculate it based on the batch size and size of training dataset and fill it in.