Open head-iie-vnr opened 5 days ago
When I used batch_size of 2 it crashed memory. Even after briniging down the initial memory state to be 13.5 GB free (1.6GB use), it still crashed hitting the 16GB upper limit.
When I reduced the batch_size to 1, it was manageble.
Special observation: Each training iteration was taking 20seconds and the same can be observed in the Graph Heart Beat. The lowest point is the start of new step(iteration)
The training data contains 40 questions & answers
Original context text: contains 900 words total sentences are 50.
Importing Required Libraries:
import json
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments
from datasets import DatasetDict, Dataset
transformers
.Loading Custom Dataset:
def load_custom_dataset(file_path):
with open(file_path, 'r') as f:
dataset_dict = json.load(f)
return dataset_dict
# Assuming your dataset file path is 'custom_qa_dataset.json'
custom_dataset = load_custom_dataset('custom_qa_dataset.json')
Converting to SQuAD Format:
def convert_to_squad_format(custom_dataset):
contexts = []
questions = []
answers = []
for data in custom_dataset["data"]:
for paragraph in data["paragraphs"]:
context = paragraph["context"]
for qa in paragraph["qas"]:
question = qa["question"]
for answer in qa["answers"]:
contexts.append(context)
questions.append(question)
answers.append({
"text": answer["text"],
"answer_start": answer["answer_start"]
})
return {
"context": contexts,
"question": questions,
"answers": answers
}
squad_format_dataset = convert_to_squad_format(custom_dataset)
Creating Hugging Face Dataset:
dataset = DatasetDict({"train": Dataset.from_dict(squad_format_dataset)})
Dataset
from the structured data and puts it into a DatasetDict
under the "train" key.Loading the Tokenizer and Model:
model_name = "allenai/longformer-base-4096"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
Tokenizing the Dataset:
def preprocess_function(examples):
inputs = tokenizer(
examples["question"],
examples["context"],
max_length=4096,
truncation=True,
padding="max_length",
return_offsets_mapping=True,
)
offset_mapping = inputs.pop("offset_mapping")
start_positions = []
end_positions = []
for i, answer in enumerate(examples["answers"]):
start_char = answer["answer_start"]
end_char = start_char + len(answer["text"])
sequence_ids = inputs.sequence_ids(i)
# Find the start and end of the context
idx = 0
while sequence_ids[idx] != 1:
idx += 1
context_start = idx
while sequence_ids[idx] == 1:
idx += 1
context_end = idx - 1
# If the answer is out of the context, label it (0, 0)
if not (offset_mapping[i][context_start][0] <= start_char and offset_mapping[i][context_end][1] >= end_char):
start_positions.append(0)
end_positions.append(0)
else:
start_idx = context_start
while offset_mapping[i][start_idx][0] <= start_char:
start_idx += 1
start_positions.append(start_idx - 1)
end_idx = context_end
while offset_mapping[i][end_idx][1] >= end_char:
end_idx -= 1
end_positions.append(end_idx + 1)
inputs["start_positions"] = start_positions
inputs["end_positions"] = end_positions
return inputs
tokenized_datasets = dataset.map(preprocess_function, batched=True, batch_size=2)
preprocess_function
tokenizes the questions and contexts.dataset.map(preprocess_function, batched=True, batch_size=2)
applies the preprocessing function to the dataset in batches of size 2.Setting Up Training Arguments:
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=1, # Reduced batch size
per_device_eval_batch_size=1, # Reduced batch size
num_train_epochs=3,
weight_decay=0.01,
gradient_accumulation_steps=8, # Use gradient accumulation
fp16=True, # Enable mixed precision training
)
Initializing the Trainer:
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["train"],
)
Trainer
with the model, training arguments, and the tokenized dataset for training and evaluation.Training the Model:
trainer.train()
Saving the Fine-Tuned Model:
model.save_pretrained("./fine-tuned-longformer")
tokenizer.save_pretrained("./fine-tuned-longformer")
The code loads a custom QA dataset, converts it to a format compatible with Hugging Face's transformers
library, tokenizes the data, sets up training parameters, trains the Longformer model on the dataset, and finally saves the fine-tuned model and tokenizer.
The results are not satisfactory
Question: When was Gandhi born? Answer: was born Score: 0.0077830287627875805 Question: Where was Gandhi born? Answer: was born
Trying different model in Issue#14