Vignana-Jyothi / kp-learnings

Curiosity & Learnings
GNU General Public License v3.0
1 stars 0 forks source link

[Gen AI] allenai Model usage. #13

Open head-iie-vnr opened 5 days ago

head-iie-vnr commented 5 days ago
head-iie-vnr commented 5 days ago

When I used batch_size of 2 it crashed memory. Even after briniging down the initial memory state to be 13.5 GB free (1.6GB use), it still crashed hitting the 16GB upper limit.

Screenshot from 2024-06-30 06-56-20

When I reduced the batch_size to 1, it was manageble. Screenshot from 2024-06-30 07-04-36

Special observation: Each training iteration was taking 20seconds and the same can be observed in the Graph Heart Beat. The lowest point is the start of new step(iteration)

head-iie-vnr commented 5 days ago

The training data contains 40 questions & answers

Original context text: contains 900 words total sentences are 50.

head-iie-vnr commented 5 days ago

Code: Step-by-Step Explanation

  1. Importing Required Libraries:

    import json
    from transformers import AutoTokenizer, AutoModelForQuestionAnswering, Trainer, TrainingArguments
    from datasets import DatasetDict, Dataset
    • json: For loading the custom dataset from a JSON file.
    • transformers: For using the Hugging Face library to handle tokenization, model loading, and training.
    • datasets: For managing the dataset in a format compatible with Hugging Face's transformers.
  2. Loading Custom Dataset:

    def load_custom_dataset(file_path):
       with open(file_path, 'r') as f:
           dataset_dict = json.load(f)
       return dataset_dict
    
    # Assuming your dataset file path is 'custom_qa_dataset.json'
    custom_dataset = load_custom_dataset('custom_qa_dataset.json')
    • This function reads a JSON file containing the custom QA dataset and loads it into a Python dictionary.
  3. Converting to SQuAD Format:

    def convert_to_squad_format(custom_dataset):
       contexts = []
       questions = []
       answers = []
    
       for data in custom_dataset["data"]:
           for paragraph in data["paragraphs"]:
               context = paragraph["context"]
               for qa in paragraph["qas"]:
                   question = qa["question"]
                   for answer in qa["answers"]:
                       contexts.append(context)
                       questions.append(question)
                       answers.append({
                           "text": answer["text"],
                           "answer_start": answer["answer_start"]
                       })
    
       return {
           "context": contexts,
           "question": questions,
           "answers": answers
       }
    
    squad_format_dataset = convert_to_squad_format(custom_dataset)
    • This function converts the custom dataset into a format similar to the SQuAD dataset format.
    • It extracts contexts, questions, and answers into separate lists and returns a dictionary with these lists.
  4. Creating Hugging Face Dataset:

    dataset = DatasetDict({"train": Dataset.from_dict(squad_format_dataset)})
    • This creates a Hugging Face Dataset from the structured data and puts it into a DatasetDict under the "train" key.
  5. Loading the Tokenizer and Model:

    model_name = "allenai/longformer-base-4096"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForQuestionAnswering.from_pretrained(model_name)
    • Loads the tokenizer and model for the Longformer, which is capable of handling long contexts up to 4096 tokens.
  6. Tokenizing the Dataset:

    def preprocess_function(examples):
       inputs = tokenizer(
           examples["question"],
           examples["context"],
           max_length=4096,
           truncation=True,
           padding="max_length",
           return_offsets_mapping=True,
       )
       offset_mapping = inputs.pop("offset_mapping")
       start_positions = []
       end_positions = []
    
       for i, answer in enumerate(examples["answers"]):
           start_char = answer["answer_start"]
           end_char = start_char + len(answer["text"])
           sequence_ids = inputs.sequence_ids(i)
    
           # Find the start and end of the context
           idx = 0
           while sequence_ids[idx] != 1:
               idx += 1
           context_start = idx
           while sequence_ids[idx] == 1:
               idx += 1
           context_end = idx - 1
    
           # If the answer is out of the context, label it (0, 0)
           if not (offset_mapping[i][context_start][0] <= start_char and offset_mapping[i][context_end][1] >= end_char):
               start_positions.append(0)
               end_positions.append(0)
           else:
               start_idx = context_start
               while offset_mapping[i][start_idx][0] <= start_char:
                   start_idx += 1
               start_positions.append(start_idx - 1)
    
               end_idx = context_end
               while offset_mapping[i][end_idx][1] >= end_char:
                   end_idx -= 1
               end_positions.append(end_idx + 1)
    
       inputs["start_positions"] = start_positions
       inputs["end_positions"] = end_positions
       return inputs
    
    tokenized_datasets = dataset.map(preprocess_function, batched=True, batch_size=2)
    • The preprocess_function tokenizes the questions and contexts.
    • It calculates the start and end positions of the answers in the tokenized context.
    • dataset.map(preprocess_function, batched=True, batch_size=2) applies the preprocessing function to the dataset in batches of size 2.
  7. Setting Up Training Arguments:

    training_args = TrainingArguments(
       output_dir="./results",
       evaluation_strategy="epoch",
       learning_rate=2e-5,
       per_device_train_batch_size=1,  # Reduced batch size
       per_device_eval_batch_size=1,  # Reduced batch size
       num_train_epochs=3,
       weight_decay=0.01,
       gradient_accumulation_steps=8,  # Use gradient accumulation
       fp16=True,  # Enable mixed precision training
    )
    • Configures the training parameters, such as output directory, learning rate, batch size, number of epochs, gradient accumulation, and mixed precision training.
  8. Initializing the Trainer:

    trainer = Trainer(
       model=model,
       args=training_args,
       train_dataset=tokenized_datasets["train"],
       eval_dataset=tokenized_datasets["train"],
    )
    • Initializes the Trainer with the model, training arguments, and the tokenized dataset for training and evaluation.
  9. Training the Model:

    trainer.train()
    • Trains the model using the specified training arguments and dataset.
  10. Saving the Fine-Tuned Model:

    model.save_pretrained("./fine-tuned-longformer")
    tokenizer.save_pretrained("./fine-tuned-longformer")
    • Saves the fine-tuned model and tokenizer to the specified directory.

Summary

The code loads a custom QA dataset, converts it to a format compatible with Hugging Face's transformers library, tokenizes the data, sets up training parameters, trains the Longformer model on the dataset, and finally saves the fine-tuned model and tokenizer.

head-iie-vnr commented 5 days ago

The results are not satisfactory

Question: When was Gandhi born? Answer: was born Score: 0.0077830287627875805 Question: Where was Gandhi born? Answer: was born

head-iie-vnr commented 5 days ago

Trying different model in Issue#14