Closed AliOsm closed 4 years ago
The Words-In-Context model is an example of binary classification -- predict whether the usage of a word in two contexts is the same sense or not. See kb/evaluation/wic_dataset_reader.py
and training_config/downstream/wic.jsonnet
.
Thanks for you reply!
Can you explain the following lines in kb/evaluation/wic_dataset_reader.py
:
# get the indices of the marked words
# index in the original tokens
idx1, idx2 = [int(ind) for ind in tokens[2].split('-')]
offsets_a = [1] + token_candidates['offsets_a'][:-1]
idx1_offset = offsets_a[idx1]
offsets_b = [token_candidates['offsets_a'][-1] + 1] + token_candidates['offsets_b'][:-1]
idx2_offset = offsets_b[idx2]
fields['index_a'] = LabelField(idx1_offset, skip_indexing=True)
fields['index_b'] = LabelField(idx2_offset, skip_indexing=True)
I didn't understand them, are they related to the WiC task?
Thanks!
Yes these are related to the WiC task -- it's code to find the indices of the marked word in each sentence (see Table 1 of https://arxiv.org/abs/1808.09121)
So, the following is a valid data reader to load a single sentence and finetune the model using it, right?
from typing import Iterable
from allennlp.data.dataset_readers.dataset_reader import DatasetReader
from allennlp.data.fields import LabelField
from allennlp.data.instance import Instance
from kb.bert_tokenizer_and_candidate_generator import TokenizerAndCandidateGenerator
from allennlp.common.file_utils import cached_path
@DatasetReader.register("x")
class XDatasetReader(DatasetReader):
def __init__(self,
tokenizer_and_candidate_generator: TokenizerAndCandidateGenerator,
entity_markers: bool = False):
super().__init__()
self.label_to_index = {'1': 1, '0': 0}
self.tokenizer = tokenizer_and_candidate_generator
self.tokenizer.whitespace_tokenize = True
self.entity_markers = entity_markers
def text_to_instance(self, line) -> Instance:
raise NotImplementedError
def _read(self, file_path: str) -> Iterable[Instance]:
"""Creates examples for the training and dev sets."""
sentences = list()
labels = list()
with open(cached_path(file_path + '-x.csv'), 'r') as file:
reader = csv.reader(file)
for row in reader:
sentences.append(row[1].strip())
with open(cached_path(file_path + '-y.csv'), 'r') as file:
reader = csv.reader(file)
for row in reader:
labels.append(int(row[1]))
assert len(labels) == len(sentences), f'The length of the labels and sentences must match. ' \
f'Got {len(labels)} and {len(sentences)}.'
for line, label in zip(sentences, labels):
text_a = line
token_candidates = self.tokenizer.tokenize_and_generate_candidates(text_a)
fields = self.tokenizer.convert_tokens_candidates_to_fields(token_candidates)
fields['label_ids'] = LabelField(self.label_to_index[label], skip_indexing=True)
instance = Instance(fields)
yield instance
Also, I have one final question about training_config/downstream/wic.jsonnet
file, can I remove the num_steps_per_epoch
option or I should calculate and put it?
Thanks!
You should calculate it based on the batch size and size of training dataset and fill it in.
How to finetune KnowBERT for binary text classification tasks?
For example, given a sequence of words predict its polarity.
Thanks!