Pitsco / personal

MIT License
0 stars 0 forks source link

Favorite Project #1 - Research Paper Review Intern #9

Open Pitsco opened 2 months ago

Pitsco commented 2 months ago

Research Paper

Hugging Face Yelp Dataset

# Install required libraries
!pip install transformers -U
!pip install transformers[torch]
!pip install accelerate>=0.20.1 -U
!pip install datasets
!pip install evaluate

# Import necessary libraries
import torch
from datasets import load_dataset

# Load dataset
dataset = load_dataset("yelp_review_full")

# Access a sample from the dataset (optional)
dataset["train"][50000]

# Tokenization
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Tokenization function
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# Apply tokenization to the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Create small training and evaluation datasets
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(25000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(25000))

# Initialize the model for sequence classification
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

# Configure training arguments
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch", num_train_epochs=3)

# Import necessary modules for metric computation
import numpy as np
import evaluate

# Load the desired metric (e.g., accuracy)
metric = evaluate.load("accuracy")

# Define function to compute evaluation metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Initialize the Trainer with the model, training arguments, datasets, and metric computation function
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics
)

# Start training
trainer.train()

Roberta:

Training Loss: This is the error calculated on the training dataset during each iteration (or epoch) of the training process. It represents how well the model is fitting the training data. The goal during training is to minimize this loss, which means the model is learning to make more accurate predictions on the training set.

Validation Loss: This is the error calculated on a separate validation dataset during each iteration (or epoch) of the training process. The validation dataset is not used for training; it's held out from the training process and used solely for evaluating the model's performance on unseen data. The validation loss provides an estimate of how well the model is generalizing to new, unseen data. If the validation loss starts to increase while the training loss is decreasing, it indicates that the model is overfitting to the training data and may not generalize well to new data.

Unsuccessful:

image

Sucessful:

image

Training Graphs (For Comparison):

image

Additions To Project:

Creating a System:

Technology Utilization:

Analysis and Evaluation: