IBM / fastfit

FastFit ⚡ When LLMs are Unfit Use FastFit ⚡ Fast and Effective Text Classification with Many Classes
Apache License 2.0
183 stars 13 forks source link

should the base model be roberta-base? #6

Open nishkalavallabhi opened 6 months ago

nishkalavallabhi commented 6 months ago

changing from roberta-base to bert-base is throwing me a forward function error, I was wondering if you test with a different base model and what is so specific to roberta in your model.

elronbandel commented 6 months ago

Thanks for noting. I'll try it myself and see. It might fail because different models have different names for the hidden states or the classification head. At one point we did run it with different models like Bert but ever since the code has changed a lot. So I will check that.

ShwetaliShimangaud commented 6 months ago

The issue is with the Tokenizer that is used. Tokenizers for Roberta or the sentence transformer(used in example code), provide "input_ids" and "attention_mask" for the text. but the Bert model has "token_type_ids" as well along with "input_ids" and "attention_mask". "token_type_ids" shows the position of the sequence in the query i.e. token_type_ids are 0s for the first sentence and 1 for the second sentence and so on. For classification task, the token_type_ids will not be useful because the input sequence is not paired(only zeros essentially not required there) unlike question-answering system. so "token_type_ids" can simply be avoided with following change:

model = FastFit.from_pretrained("fastfit-bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer,return_all_scores=False)

#Instead of following 
pred = classifier(text)

# Do this
preprocessed_ip = classifier.preprocess(text)
model_op = classifier.forward({'input_ids': preprocessed_ip['input_ids'],'attention_mask': preprocessed_ip['attention_mask']})
pred = classifier.postprocess(model_op, top_k = 2)