[Gen AI] allenai Model usage.

head-iie-vnr commented 5 days ago

[ ] Create custom training data
[ ] Train on the custom training data.

head-iie-vnr commented 5 days ago

When I used batch_size of 2 it crashed memory. Even after briniging down the initial memory state to be 13.5 GB free (1.6GB use), it still crashed hitting the 16GB upper limit.

Screenshot from 2024-06-30 06-56-20

When I reduced the batch_size to 1, it was manageble. Screenshot from 2024-06-30 07-04-36

Special observation: Each training iteration was taking 20seconds and the same can be observed in the Graph Heart Beat. The lowest point is the start of new step(iteration)

head-iie-vnr commented 5 days ago

The training data contains 40 questions & answers

Original context text: contains 900 words total sentences are 50.

head-iie-vnr commented 5 days ago

Code: Step-by-Step Explanation

Importing Required Libraries:
```
import json
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments
from datasets import DatasetDict, Dataset
```
- json: For loading the custom dataset from a JSON file.
- transformers: For using the Hugging Face library to handle tokenization, model loading, and training.
- datasets: For managing the dataset in a format compatible with Hugging Face's transformers.

Loading Custom Dataset:

def load_custom_dataset(file_path):
   with open(file_path, 'r') as f:
       dataset_dict = json.load(f)
   return dataset_dict

# Assuming your dataset file path is 'custom_qa_dataset.json'
custom_dataset = load_custom_dataset('custom_qa_dataset.json')

This function reads a JSON file containing the custom QA dataset and loads it into a Python dictionary.

Converting to SQuAD Format:

def convert_to_squad_format(custom_dataset):
   contexts = []
   questions = []
   answers = []

   for data in custom_dataset["data"]:
       for paragraph in data["paragraphs"]:
           context = paragraph["context"]
           for qa in paragraph["qas"]:
               question = qa["question"]
               for answer in qa["answers"]:
                   contexts.append(context)
                   questions.append(question)
                   answers.append({
                       "text": answer["text"],
                       "answer_start": answer["answer_start"]
                   })

   return {
       "context": contexts,
       "question": questions,
       "answers": answers
   }

squad_format_dataset = convert_to_squad_format(custom_dataset)

This function converts the custom dataset into a format similar to the SQuAD dataset format.
It extracts contexts, questions, and answers into separate lists and returns a dictionary with these lists.

Creating Hugging Face Dataset:
```
dataset = DatasetDict({"train": Dataset.from_dict(squad_format_dataset)})
```
- This creates a Hugging Face Dataset from the structured data and puts it into a DatasetDict under the "train" key.

Loading the Tokenizer and Model:

model_name = "allenai/longformer-base-4096"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

Loads the tokenizer and model for the Longformer, which is capable of handling long contexts up to 4096 tokens.

Tokenizing the Dataset:

def preprocess_function(examples):
   inputs = tokenizer(
       examples["question"],
       examples["context"],
       max_length=4096,
       truncation=True,
       padding="max_length",
       return_offsets_mapping=True,
   )
   offset_mapping = inputs.pop("offset_mapping")
   start_positions = []
   end_positions = []

   for i, answer in enumerate(examples["answers"]):
       start_char = answer["answer_start"]
       end_char = start_char + len(answer["text"])
       sequence_ids = inputs.sequence_ids(i)

       # Find the start and end of the context
       idx = 0
       while sequence_ids[idx] != 1:
           idx += 1
       context_start = idx
       while sequence_ids[idx] == 1:
           idx += 1
       context_end = idx - 1

       # If the answer is out of the context, label it (0, 0)
       if not (offset_mapping[i][context_start][0] <= start_char and offset_mapping[i][context_end][1] >= end_char):
           start_positions.append(0)
           end_positions.append(0)
       else:
           start_idx = context_start
           while offset_mapping[i][start_idx][0] <= start_char:
               start_idx += 1
           start_positions.append(start_idx - 1)

           end_idx = context_end
           while offset_mapping[i][end_idx][1] >= end_char:
               end_idx -= 1
           end_positions.append(end_idx + 1)

   inputs["start_positions"] = start_positions
   inputs["end_positions"] = end_positions
   return inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True, batch_size=2)

The preprocess_function tokenizes the questions and contexts.
It calculates the start and end positions of the answers in the tokenized context.
dataset.map(preprocess_function, batched=True, batch_size=2) applies the preprocessing function to the dataset in batches of size 2.

Setting Up Training Arguments:

training_args = TrainingArguments(
   output_dir="./results",
   evaluation_strategy="epoch",
   learning_rate=2e-5,
   per_device_train_batch_size=1,  # Reduced batch size
   per_device_eval_batch_size=1,  # Reduced batch size
   num_train_epochs=3,
   weight_decay=0.01,
   gradient_accumulation_steps=8,  # Use gradient accumulation
   fp16=True,  # Enable mixed precision training
)

Configures the training parameters, such as output directory, learning rate, batch size, number of epochs, gradient accumulation, and mixed precision training.

Initializing the Trainer:

trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_datasets["train"],
   eval_dataset=tokenized_datasets["train"],
)

Initializes the Trainer with the model, training arguments, and the tokenized dataset for training and evaluation.

Training the Model:
```
trainer.train()
```
- Trains the model using the specified training arguments and dataset.

Saving the Fine-Tuned Model:

model.save_pretrained("./fine-tuned-longformer")
tokenizer.save_pretrained("./fine-tuned-longformer")

Saves the fine-tuned model and tokenizer to the specified directory.

Summary

The code loads a custom QA dataset, converts it to a format compatible with Hugging Face's transformers library, tokenizes the data, sets up training parameters, trains the Longformer model on the dataset, and finally saves the fine-tuned model and tokenizer.

head-iie-vnr commented 5 days ago

The results are not satisfactory

Question: When was Gandhi born? Answer: was born Score: 0.0077830287627875805 Question: Where was Gandhi born? Answer: was born

head-iie-vnr commented 5 days ago

Trying different model in Issue#14

Vignana-Jyothi / kp-learnings

[Gen AI] allenai Model usage. #13

Code: Step-by-Step Explanation

Summary