cgmhaicenter / exBERT

Apache License 2.0
57 stars 15 forks source link

Pretraining for sequence classification #7

Open TahaAslani opened 3 years ago

TahaAslani commented 3 years ago

Hi,

I am implementing fine-tuning exBERT for sequence classification. I already have done the pretraining for my data. However, since the pre-training python script that you have provided is only for NER, I was wondering how should I implement tokenization. Should I just load the model and tokenized my text like this?

from exBERT import BertTokenizer, BertForSequenceClassification model = BertForSequenceClassification(path_to_config_file_of_the_OFF_THE_SHELF_MODEL, config_and_vocab/exBERT_no_ex_vocab/bert_config_ex_s3.json, len(list_of_lables)) tokenizer = BertTokenizer(path_to_off_the_shelf_model_vocab)

and just use it as a regular model in hugging face, or I have to add certain lines for handling the new vocabulary (tokens that start with ##)

Thanks for your help in advance!

sonicrux commented 2 years ago

This is the way I got classification to work -

# Get your imports
from exBERT import BertForSequenceClassification, BertConfig
from transformers import BertTokenizer

# Load in your config files 
bert_config_1 = BertConfig.from_json_file('path_to_off_the_shelf_config_file')
bert_config_2 = BertConfig.from_json_file('updated_config_file_with_new_vocab_size')

# Initialize your classification object
num_labels = 2
model = BertForSequenceClassification(bert_config_1,bert_config_2, num_labels=num_labels)

# Load in your pretrained state dict
model.load_state_dict(torch.load('path_to_state_dict_from_pretraining'), strict=False)

# Initialize tokenizer 
tokenizer = BertTokenizer(vocab_file='path_to_augmented_vocab.txt')

# Tokenize input
input_ids = []
attention_masks = []
for sentence in sentences:
        encoded_dict = tokenizer.encode_plus(
                            sentence,                      # Sentence to encode.
                            add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                            max_length = 512,           # Pad & truncate all sentences.
                            pad_to_max_length = True,
                            return_attention_mask = True,   # Construct attn. masks.
                            return_tensors = 'pt',     # Return pytorch tensors.
                       )

        # Add the encoded sentence to the list.    
        input_ids.append(encoded_dict['input_ids'])

        # And its attention mask (simply differentiates padding from non-padding).
        attention_masks.append(encoded_dict['attention_mask'])

# At this point you'll convert your input_ids, attention_masks and labels to pytorch tensors

# Get model output 
# You should probably batch-ify this
(loss, logits) = model(input_ids, 
                                      token_type_ids=None, 
                                      attention_mask=attention_masks,
                                      labels=labels)