Can these methods be applied to llama model training?

SimonBlanke / Gradient-Free-Optimizers

Simple and reliable optimization with local, global, population-based and sequential techniques in numerical discrete search spaces.

https://simonblanke.github.io/gradient-free-optimizers-documentation

MIT License

1.2k stars 83 forks source link

Can these methods be applied to llama model training? #49

Open AnonymityGithub opened 1 month ago

AnonymityGithub commented 1 month ago

It would be great if it could support BERT, LLaMA and other model training.

SimonBlanke commented 1 month ago

Hello @AnonymityGithub,

GFO is not designed to be used as an optimizer in deep-learning frameworks (like pytorch or tensorflow), because its optimization-algorithms work without utilizing the gradient (or 1. derivative) of a function evaluation. In theory it is still possible to use GFO for this task, by not using the derivative and only using the loss itself to adapt the weights of the neural-network. This problem with this is: Problems in which you are able to get the derivative, should use it to get better results. It is important information you leave out, that would otherwise help your optimization-algorithm to find better solutions in shorter time. GFO is designed to be used for problems, where you do not have the 1.-order derivative but only the 0.-order derivative (so the funtion evaluation itself). One problem example related to machine-learning, that does not return a 1.-order derivative is hyperparameter-optimization. You can only use gradient-free-optimization algorithms for this kind of problem.

However, I think it would be interesting to expand my work in the future (because I really like writing optimization-algorithms). Maybe I will implement a separate package for optimizers in deep-learning frameworks at some point.

I will close this issue, but feel free to ask followup questions of needed.

AnonymityGithub commented 1 month ago

Thanks vfor your reply. Using gradient-based methods to optimize the model is indeed the best solution. However, after the emergence of LLMS, gradient-based optimization methods require larger memory, so gradient-free methods have emerged to optimize the model, so I am curious whether these methods can be used on deep learning models.

SimonBlanke commented 1 month ago

However, after the emergence of LLMS, gradient-based optimization methods require larger memory, so gradient-free methods have emerged to optimize the model ...

Intriguing! I was not aware of this. Within the next days I will do some research. If you have some articles or other sources you could post them here.

so I am curious whether these methods can be used on deep learning models

I would be very motivated in trying to implement a short "Proof of Concept". Do you have an idea, which deep-learning framework to use for this? If I want to prove, that the optimizer works I would need a model, that is as simple as possible for development and testing.

AnonymityGithub commented 1 month ago

However, after the emergence of LLMS, gradient-based optimization methods require larger memory, so gradient-free methods have emerged to optimize the model ...

Intriguing! I was not aware of this. Within the next days I will do some research. If you have some articles or other sources you could post them here.

so I am curious whether these methods can be used on deep learning models

I would be very motivated in trying to implement a short "Proof of Concept". Do you have an idea, which deep-learning framework to use for this? If I want to prove, that the optimizer works I would need a model, that is as simple as possible for development and testing.

I think pytorch+huggingface is better. Huggingface provides most of the deep learning models.

some papers:

MeZO: Fine-Tuning Language Models with Just Forward Passes Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-Order Optimization Private Fine-tuning of Large Language Models with Zeroth-order Optimization

Thanks.

SimonBlanke commented 1 month ago

I did some reading on this topic and worked out a concept, how to apply GFO to a pytorch model. As a first step I would like to apply GFO to a simple model (a small CNN or MLP).

misonsky commented 1 month ago

I did some reading on this topic and worked out a concept, how to apply GFO to a pytorch model. As a first step I would like to apply GFO to a simple model (a small CNN or MLP).

Great！If successful, it can also be applied to more complex models.

SimonBlanke commented 1 month ago

So, I did some tinkering with pytorch and managed to create a very simple example, how GFO could be used as an custom optimizer. Disclaimer: I don't use pytorch on a regular basis. So this code is probably of bad quality. It is just a showcase.

About the results: Currently this custom optimizer as a very bad performance, mainly because of two reasons:

It is very slow compared to the build-in pytorch optimizers
The initial loss is extremely high

About the speed: This might not matter that much for more complex models. This is because the difference between 0.01 and 1 second of delay per training step does not matter of the total training step takes several seconds for a large model. However it is still something I should look into. Maybe there is a performance bottleneck somewhere in my GFO source-code.

About the very high initial loss:
I am not sure about why this occurs. Maybe it is because GFO selects its initial parameters differently. But the difference between the custom GFO optimizer (initial loss ~1832693817933824) and the SGD optimizer from pytorch (initial loss ~0.4876) is so high, that I expect a bug in my code. As I mentioned before, I don't use deep-learning frameworks on a regular basis. So I might do some simple mistakes.

I would appreciate very bit of help or hints about this showcase.

I ran this code with the following versions: python==3.12.4 numpy==1.26.4 torch==2.3.1 gradient-free-optimizers==1.4.1

import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import DataLoader, TensorDataset
from gradient_free_optimizers import (
    HillClimbingOptimizer,
    RandomSearchOptimizer,
    RepulsingHillClimbingOptimizer,
    PowellsMethod,
)

# Define a synthetic dataset
np.random.seed(42)
X = np.random.rand(1000, 20)
true_weights = np.random.rand(20, 1)
y = X @ true_weights + 0.1 * np.random.randn(1000, 1)

X = torch.Tensor(X)
y = torch.Tensor(y)

# Create a DataLoader
dataset = TensorDataset(X, y)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

num_epochs = 100

# Define a more complex neural network
class ComplexModel(nn.Module):
    def __init__(self):
        super(ComplexModel, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(20, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, 1),
        )

    def forward(self, x):
        return self.network(x)

# Initialize the model
model = ComplexModel()

# Define a loss function
criterion = nn.MSELoss()

# Define the custom optimizer with GFO
class GFOOptimizer(torch.optim.Optimizer):
    def __init__(self, params, model, dataloader, criterion, lr=1e-3):
        self.model = model
        self.dataloader = dataloader
        self.criterion = criterion
        self.lr = lr

        self.nth_iter = 0

        # Flatten the initial model parameters
        self.params = []
        for param in self.model.parameters():
            self.params.extend(param.data.cpu().numpy().flatten())

        # Define the search space
        self.search_space = {
            f"x{i}": np.arange(-1.0, 1.0, 0.1, dtype=np.float32)
            for i in range(len(self.params))
        }

        # Initialize the GFO optimizer
        self.optimizer = PowellsMethod(self.search_space)

        self.optimizer.init_search(
            objective_function=self.objective_function,
            n_iter=num_epochs * len(dataloader),
            max_time=None,
            max_score=None,
            early_stopping=None,
            memory=True,
            memory_warm_start=None,
            verbosity=[],
        )

        defaults = dict(lr=lr)
        super(GFOOptimizer, self).__init__(params, defaults)

    def objective_function(self, opt_params):
        opt_params_l = list(opt_params.values())

        # Set model parameters
        start = 0
        for param in self.model.parameters():
            param_length = param.numel()
            param.data = torch.tensor(
                opt_params_l[start : start + param_length]
            ).view(param.shape)
            start += param_length

        # Compute the loss
        total_loss = 0.0
        with torch.no_grad():
            for batch_X, batch_y in self.dataloader:
                outputs = self.model(batch_X)
                loss = self.criterion(outputs, batch_y)
                total_loss += loss.item()
        return total_loss / len(self.dataloader)

    def step(self, closure=None):
        if closure is not None:
            closure()

        # Use GFO to find the best parameters
        self.optimizer.search_step(self.nth_iter)
        best_params = self.optimizer.pos_best

        # Set the best parameters to the model
        start = 0
        for param in self.model.parameters():
            param_length = param.numel()
            param.data.copy_(
                torch.tensor(
                    best_params[start : start + param_length],
                    dtype=torch.float32,
                ).view(param.shape)
            )

            start += param_length

        self.params = best_params
        self.nth_iter += 1

# Initialize the custom optimizer
optimizer = GFOOptimizer(
    model.parameters(), model, dataloader, criterion, lr=0.01
)

# Training loop
for epoch in range(num_epochs):
    for batch_X, batch_y in dataloader:
        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass
        loss.backward()

        # Update the weights
        optimizer.step()

    # Print the loss for every epoch
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

misonsky commented 1 month ago

About the results: Currently this custom optimizer as a very bad performance, mainly because of two reasons:

It is very slow compared to the build-in pytorch optimizers
The initial loss is extremely high

About the very high initial loss: I am not sure about why this occurs. Maybe it is because GFO selects its initial parameters differently. But the difference between the custom GFO optimizer (initial loss ~1832693817933824) and the SGD optimizer from pytorch (initial loss ~0.4876) is so high, that I expect a bug in my code. As I mentioned before, I don't use deep-learning frameworks on a regular basis. So I might do some simple mistakes.

I would appreciate very bit of help or hints about this showcase.

I ran this code with the following versions: python==3.12.4 numpy==1.26.4 torch==2.3.1 gradient-free-optimizers==1.4.1

import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import DataLoader, TensorDataset
from gradient_free_optimizers import (
    HillClimbingOptimizer,
    RandomSearchOptimizer,
    RepulsingHillClimbingOptimizer,
    PowellsMethod,
)

# Define a synthetic dataset
np.random.seed(42)
X = np.random.rand(1000, 20)
true_weights = np.random.rand(20, 1)
y = X @ true_weights + 0.1 * np.random.randn(1000, 1)

X = torch.Tensor(X)
y = torch.Tensor(y)

# Create a DataLoader
dataset = TensorDataset(X, y)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

num_epochs = 100

# Define a more complex neural network
class ComplexModel(nn.Module):
    def __init__(self):
        super(ComplexModel, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(20, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, 1),
        )

    def forward(self, x):
        return self.network(x)

# Initialize the model
model = ComplexModel()

# Define a loss function
criterion = nn.MSELoss()

# Define the custom optimizer with GFO
class GFOOptimizer(torch.optim.Optimizer):
    def __init__(self, params, model, dataloader, criterion, lr=1e-3):
        self.model = model
        self.dataloader = dataloader
        self.criterion = criterion
        self.lr = lr

        self.nth_iter = 0

        # Flatten the initial model parameters
        self.params = []
        for param in self.model.parameters():
            self.params.extend(param.data.cpu().numpy().flatten())

        # Define the search space
        self.search_space = {
            f"x{i}": np.arange(-1.0, 1.0, 0.1, dtype=np.float32)
            for i in range(len(self.params))
        }

        # Initialize the GFO optimizer
        self.optimizer = PowellsMethod(self.search_space)

        self.optimizer.init_search(
            objective_function=self.objective_function,
            n_iter=num_epochs * len(dataloader),
            max_time=None,
            max_score=None,
            early_stopping=None,
            memory=True,
            memory_warm_start=None,
            verbosity=[],
        )

        defaults = dict(lr=lr)
        super(GFOOptimizer, self).__init__(params, defaults)

    def objective_function(self, opt_params):
        opt_params_l = list(opt_params.values())

        # Set model parameters
        start = 0
        for param in self.model.parameters():
            param_length = param.numel()
            param.data = torch.tensor(
                opt_params_l[start : start + param_length]
            ).view(param.shape)
            start += param_length

        # Compute the loss
        total_loss = 0.0
        with torch.no_grad():
            for batch_X, batch_y in self.dataloader:
                outputs = self.model(batch_X)
                loss = self.criterion(outputs, batch_y)
                total_loss += loss.item()
        return total_loss / len(self.dataloader)

    def step(self, closure=None):
        if closure is not None:
            closure()

        # Use GFO to find the best parameters
        self.optimizer.search_step(self.nth_iter)
        best_params = self.optimizer.pos_best

        # Set the best parameters to the model
        start = 0
        for param in self.model.parameters():
            param_length = param.numel()
            param.data.copy_(
                torch.tensor(
                    best_params[start : start + param_length],
                    dtype=torch.float32,
                ).view(param.shape)
            )

            start += param_length

        self.params = best_params
        self.nth_iter += 1

# Initialize the custom optimizer
optimizer = GFOOptimizer(
    model.parameters(), model, dataloader, criterion, lr=0.01
)

# Training loop
for epoch in range(num_epochs):
    for batch_X, batch_y in dataloader:
        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass
        loss.backward()

        # Update the weights
        optimizer.step()

    # Print the loss for every epoch
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

OK. I am very happy to provide help; this may take some time.

SimonBlanke commented 1 month ago

As mentioned before I identified two reasons for the bad result when using GFO as a pytorch optimizer so far. I am currently working on the slow performance of GFO and will continue focusing on this.

Of course the goal is to achieve good performance for very high dimensional search-spaces. This might even require some public API changes. Like when creating the search-space:

n_dims = 10000
search_space = {
    f"x{i}": np.arange(-1.0, 1.0, 0.1, dtype=np.float32) for i in range(n_dims)
}

Creating a search-space this way will get really slow for >1000 dimensions. Each iteration also slows down further.

I will do some extensive performance testing within the next few days and weeks. To make this work will require a lot of effort, but I think it will payout. GFO has some very powerful algorithms and it would be awesome to see them in action as a pytorch optimizer!

misonsky commented 1 month ago

As mentioned before I identified two reasons for the bad result when using GFO as a pytorch optimizer so far. I am currently working on the slow performance of GFO and will continue focusing on this.

Of course the goal is to achieve good performance for very high dimensional search-spaces. This might even require some public API changes. Like when creating the search-space:
n_dims = 10000
search_space = {
    f"x{i}": np.arange(-1.0, 1.0, 0.1, dtype=np.float32) for i in range(n_dims)
}
Creating a search-space this way will get really slow for >1000 dimensions. Each iteration also slows down further.

I will do some extensive performance testing within the next few days and weeks. To make this work will require a lot of effort, but I think it will payout. GFO has some very powerful algorithms and it would be awesome to see them in action as a pytorch optimizer!

Sorry, I've been too busy lately. I can provide a text classification baseline that I haven't adapted to GFO yet. The accuracy of the baseline is around 94%.

misonsky commented 1 month ago

import torch
import time
from torchtext.datasets import AG_NEWS

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader
from torch import nn

from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

train_iter = iter(AG_NEWS(split="train"))

tokenizer = get_tokenizer("basic_english")
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for _label, _text in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)
train_iter = iter(AG_NEWS(split="train"))
dataloader = DataLoader(
    train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch
)

class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)
train_iter = iter(AG_NEWS(split="train"))
num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
emsize = 64
model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print(
                "| epoch {:3d} | {:5d}/{:5d} batches "
                "| accuracy {:8.3f}".format(
                    epoch, idx, len(dataloader), total_acc / total_count
                )
            )
            total_acc, total_count = 0, 0
            start_time = time.time()

def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc / total_count

# Hyperparameters
EPOCHS = 10  # epoch
LR = 5  # learning rate
BATCH_SIZE = 64  # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
train_iter, test_iter = AG_NEWS()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = random_split(
    train_dataset, [num_train, len(train_dataset) - num_train]
)

train_dataloader = DataLoader(
    split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
valid_dataloader = DataLoader(
    split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
test_dataloader = DataLoader(
    test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val:
        scheduler.step()
    else:
        total_accu = accu_val
    print("-" * 59)
    print(
        "| end of epoch {:3d} | time: {:5.2f}s | "
        "valid accuracy {:8.3f} ".format(
            epoch, time.time() - epoch_start_time, accu_val
        )
    )
    print("-" * 59)

portalocker==2.8.2 torch==2.3.1 torchtext==0.18.0

misonsky commented 1 month ago

import torch
import time
from torchtext.datasets import AG_NEWS

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader
from torch import nn

from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

train_iter = iter(AG_NEWS(split="train"))

tokenizer = get_tokenizer("basic_english")
def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for _label, _text in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)
train_iter = iter(AG_NEWS(split="train"))
dataloader = DataLoader(
    train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch
)

class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)
train_iter = iter(AG_NEWS(split="train"))
num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
emsize = 64
model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print(
                "| epoch {:3d} | {:5d}/{:5d} batches "
                "| accuracy {:8.3f}".format(
                    epoch, idx, len(dataloader), total_acc / total_count
                )
            )
            total_acc, total_count = 0, 0
            start_time = time.time()

def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc / total_count

# Hyperparameters
EPOCHS = 10  # epoch
LR = 5  # learning rate
BATCH_SIZE = 64  # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
train_iter, test_iter = AG_NEWS()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = random_split(
    train_dataset, [num_train, len(train_dataset) - num_train]
)

train_dataloader = DataLoader(
    split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
valid_dataloader = DataLoader(
    split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
test_dataloader = DataLoader(
    test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val:
        scheduler.step()
    else:
        total_accu = accu_val
    print("-" * 59)
    print(
        "| end of epoch {:3d} | time: {:5.2f}s | "
        "valid accuracy {:8.3f} ".format(
            epoch, time.time() - epoch_start_time, accu_val
        )
    )
    print("-" * 59)

portalocker==2.8.2 torch==2.3.1 torchtext==0.18.0

You can test the performance of GFO based on this baseline. In the future, I may also do some tests to evaluate the feasibility of GFO.

mosheduminer commented 4 weeks ago

Hey 👋, I saw this library on HN a few days ago, and then found this issue.

About the very high initial loss: I am not sure about why this occurs.

I tried playing with this a bit, and the most important thing I noticed was the line best_params = self.optimizer.pos_best, where pos_best consists of integers between 0 and 19 (presumably positions / indexes of some sort), instead of floats between -1 and 1. I tried using self.optimizier.best_para instead, and got an AttributeError: 'PowellsMethod' object has no attribute 'best_para'.

SimonBlanke commented 4 weeks ago

Hello @mosheduminer,

welcome to this project.

pos_best consists of integers between 0 and 19

you are correct! The values in the search-space are converted into integers on optimizer level. So there are pos_best, values_best and params_best. Those are the current best variables. I used the name best_para for the parameters (a dict with the best values) for the overall best parameters of the entire search after the optimization run ends. That is why the optimizer does not have this variable at that step.

I was quite oblivious to this mistake. It should have been obvious to me, since I implemented this. Thank you for this contribution! :-)

I uploaded the pytorch optimizer example to the dev-branch (so user cannot see this incomplete example): https://github.com/SimonBlanke/Gradient-Free-Optimizers/blob/dev/examples/_pytorch_optimizer.py

If you like you can add your contribution to the example.

edit: I just saw/remembered that the new values are not stored as attributes in the optimizer class: https://github.com/SimonBlanke/Gradient-Free-Optimizers/blob/master/gradient_free_optimizers/results_manager.py#L31

mosheduminer commented 4 weeks ago

If you like you can add your contribution to the example.

Sure! For the moment though, normalizing pos_best, the loss is still way too high (though only double digits instead of some huge number), while the expected behavior is that the number should swiftly drop below 1.0. So there's more work needed here.

mosheduminer commented 3 weeks ago

@SimonBlanke after further experimenting (with the _pytorch_optimizer.py now), I think another possible issue is that in GFO, higher score (from objective_function) is better, but that score is (in this pytorch example) computed from the loss, where lower is better. So it's actually optimizing for worse loss, instead of better.

Using return 1 / (total_loss / len(self.dataloader)) (instead of return total_loss / len(self.dataloader)) in the objective function is actually somewhat better, but the loss then jumps around instead of getting mostly getting better.

BTW the example is missing the self.nth_iter (to pass to search_step) that the example in this thread has.

SimonBlanke commented 3 weeks ago

Hello @mosheduminer,

you are correct about the maximization in GFO. I hesitated to update the example, because I thought you might want to open a PR. This way your contribution is on record (which is nice to have).

BTW the example is missing the self.nth_iter (to pass to search_step) that the example in this thread has.

yeah I accidentally used a dev branch for GFO that does not require the nth_iter. I am working on improving the public API for the search_step and other methods on the same level. GFO should handle nth_iter internally and there are other areas that need some improvement, too. But I will open a separate issue for this.

I focused on the performance of different algorithms in the last few days and found similar results. Many hill-climbing based algorithms barely improve the loss. I also saw this jumping-behaviour in the loss.

I went through every algorithm and expected the particle swarm optimizer to perform the best. Because of the attraction behaviour towards better performing particles I thought this could "imitate" something like a gradient in this large search-space. It showed much less of a jumping behaviour, but also failed to converge.

From my perspective the main problem is the extremely high dimensionality of the problem. And for high dim. problems a gradient becomes even more valuable. Just compare a gradient-based approach to e.g. the pattern-search:

With a gradient you automatically get the direction of improvement
The pattern-search would need 2 x n-dim evaluations of the objective-function to have a direction

So for even a small neural network with ~1000 parameters we get a huge search-space, which leads to basically no improvement of the loss when using GFO for a reasonable training-time.

At the moment I cannot see even one algorithm in Gradient-Free-Optimizers, that would be fit for this kind of problem. Sequence-model-based optimizers unfortunately cannot be used, because of the large search-space.

Nevertheless I find this topic very interesting, but it should be approached from another angle. I see several alternatives, how to proceed:

Read more about GFO for neural networks and look into the algorithms used
Find other existing algorithms and try to adapt them to the problem
Scrap the GFO approach and write a separate project that uses a gradient, but includes all the knowledge I gained from GFO. (This might be my favourite)

mosheduminer commented 3 weeks ago

Perhaps it'd be worthwhile to use fewer input dimensions and see if everything works then, to make sure that everything is correct implementation-wise. In that case I'd happily make a PR to track work on that.

I suppose that it'd be great if you incorporated GFO knowledge into a ML specific project 🙂.

mosheduminer commented 3 weeks ago

@SimonBlanke The code below successfully optimizes the Neural Network to an MSE loss in the range of 0.07-0.1, over 1500 epochs. This, despite the fact that there are 20 dimensions. I started with 2 dimensions (which required about 100 epochs), and then increased the dimensions and epochs.

That said, an SGD optimizer would get to the same loss in less than 10 epochs (and each epoch is much faster with SGD), so I don't know about the value of this approach - I'm sure you and others with knowledge of DL would know more.

Should I make a PR to the examples directory?


import torch
import torch.nn as nn
import numpy as np
from torch.utils.data import DataLoader, TensorDataset
from gradient_free_optimizers import PowellsMethod

# Define a synthetic dataset
np.random.seed(42)
X = np.random.rand(1000, 20)
true_weights = np.random.rand(20, 1)
y = X @ true_weights + 0.1 * np.random.randn(1000, 1)

X = torch.Tensor(X)
y = torch.Tensor(y)

# Create a DataLoader
dataset = TensorDataset(X, y)
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)

num_epochs = 1500

# Define a more complex neural network
class ComplexModel(nn.Module):
    def __init__(self):
        super(ComplexModel, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(20, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, 1),
        )

    def forward(self, x):
        return self.network(x)

# Initialize the model
model = ComplexModel()

# Define a loss function
criterion = nn.MSELoss()

# Define the custom optimizer with GFO
class GFOOptimizer(torch.optim.Optimizer):
    def __init__(self, params, model, dataloader, criterion, lr=1e-3):
        self.model = model
        self.dataloader = dataloader
        self.criterion = criterion
        self.lr = lr

        self.nth_iter = 0

        # Flatten the initial model parameters
        self.initial_weights = {}
        self.params = []
        counter = 0
        for param in self.model.parameters():
            self.params.extend(param.data.cpu().numpy().flatten())

            for value in param.data.flatten():
                self.initial_weights[f"x{counter}"] = (
                    value.item()
                )  # Convert tensor value to Python scalar
                counter += 1

        # Define the search space
        self.search_space = {
            f"x{i}": np.arange(-1.0, 1.0, 0.1, dtype=np.float32)
            for i in range(len(self.params))
        }

        # Initialize the GFO optimizer
        self.optimizer = PowellsMethod(
            self.search_space, initialize={"warm_start": [self.initial_weights]}
        )

        self.optimizer.init_search(
            objective_function=self.objective_function,
            n_iter=num_epochs * len(dataloader),
            max_time=None,
            max_score=None,
            early_stopping=None,
            memory=True,
            memory_warm_start=None,
            verbosity=[],
        )

        defaults = dict(lr=lr)
        super().__init__(params, defaults)

    def objective_function(self, opt_params):
        opt_params_l = list(opt_params.values())

        # Set model parameters
        start = 0
        for param in self.model.parameters():
            param_length = param.numel()
            param.data = torch.tensor(opt_params_l[start : start + param_length]).view(
                param.shape
            )
            start += param_length

        # Compute the loss
        total_loss = 0.0
        with torch.no_grad():
            for batch_X, batch_y in self.dataloader:
                outputs = self.model(batch_X)
                loss = self.criterion(outputs, batch_y)
                total_loss += loss.item()
        return -total_loss / len(self.dataloader)

    def step(self, closure=None):
        if closure is not None:
            closure()

        # Use GFO to find the best parameters
        self.optimizer.search_step(self.nth_iter)
        # best_params = self.optimizer.pos_new
        best_params = self.optimizer.conv.position2value(self.optimizer.pos_new)

        # print("self.optimizer.score_new", self.optimizer.score_new)

        # Set the best parameters to the model
        start = 0
        for param in self.model.parameters():
            param_length = param.numel()
            # """
            param.data.copy_(
                torch.tensor(
                    best_params[start : start + param_length],
                    dtype=torch.float32,
                ).view(param.shape)
            )
            # """
            start += param_length

        self.params = best_params
        self.nth_iter += 1

# Initialize the custom optimizer
optimizer = GFOOptimizer(model.parameters(), model, dataloader, criterion, lr=0.01)
# optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

# Training loop
for epoch in range(num_epochs):
    for batch_X, batch_y in dataloader:
        # Zero the gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass
        # loss.backward()

        # Update the weights
        optimizer.step()

    # Print the loss for every epoch
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")

print("Training completed!")

SimonBlanke commented 2 weeks ago

Hello @mosheduminer,

thank you for continuing this experiment. The result is in line with what I would expect for gradient-free-optimization algorithms. It would be useful to open a PR. Maybe you could add a small comment to your example that describes the result. This way we have a good start point if we want to revisit this feature-proposal.