AI4Bharat / Indic-BERT-v1

Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and Indian-English. For latest Indic-BERT v2, check: https://github.com/AI4Bharat/IndicBERT
https://indicnlp.ai4bharat.org
MIT License
276 stars 41 forks source link

Not able to fine tune for text classification using Huggingface library #29

Closed shaktisd closed 3 years ago

shaktisd commented 3 years ago

Hi I am trying to fine tune indic bert on IITP Movie Reviews . It is not working for AutoTokenizer, AutoModelForSequenceClassification or AlbertTokenizer, AlbertForSequenceClassification Following is the code I am trying to run , getting same exception if I change to Auto from Albert.

Getting error : ValueError: Checkpoint was expecting a trackable object (an object derived from TrackableBase), got AlbertForSequenceClassification(

import pandas as pd
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
# Import generic wrappers
#from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import AlbertTokenizer, AlbertForSequenceClassification

train_df = pd.read_csv(r'data\iitp-movie-reviews\hi\hi-train.csv' , names=['label','text'])
train_df['text'] = train_df['text'].str.replace('\n','')

test_df = pd.read_csv(r'data\iitp-movie-reviews\hi\hi-test.csv' , names=['label','text'])
test_df['text'] = test_df['text'].str.replace('\n','')

valid_df = pd.read_csv(r'data\iitp-movie-reviews\hi\hi-valid.csv' , names=['label','text'])
valid_df['text'] = test_df['text'].str.replace('\n','')

display(train_df.head(2))
display(train_df['label'].unique())

le = LabelEncoder()
x_train = train_df['text'].tolist()
y_train = list(le.fit_transform(train_df['label'].tolist()))

x_test = test_df['text'].tolist()
y_test = list(le.transform(test_df['label'].tolist()))

x_valid = valid_df['text'].tolist()
y_valid = list(le.transform(valid_df['label'].tolist()))

# Define the model repo
model_name = "ai4bharat/indic-bert" 

tokenizer = AlbertTokenizer.from_pretrained(model_name)

train_encodings = tokenizer(x_train, truncation=True, padding=True)
test_encodings = tokenizer(x_test, truncation=True, padding=True)
valid_encodings = tokenizer(x_valid, truncation=True, padding=True)

import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    y_train
))
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(valid_encodings),
    y_valid
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    y_test
))

from transformers import TFTrainer, TFTrainingArguments

training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

with training_args.strategy.scope():
    model = AlbertForSequenceClassification.from_pretrained(model_name)

trainer = TFTrainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()
gowtham1997 commented 3 years ago

Seems to be a TensorFlow checkpoint loading error.

Can you try running the same snippet without TF ie using PyTorch (TFTrainingArguments -> TrainingArguments, TFTrainer -> Trainer, etc) and see if you get an error?