ARM-software / mango

Parallel Hyperparameter Tuning in Python
Apache License 2.0
399 stars 46 forks source link

ValueError when using mango #56

Closed ahmedramly closed 2 years ago

ahmedramly commented 2 years ago

Hi,

I am trying to use mango for the first time and I was trying to fine tune the hyperparameters of a simple CNN model. I chose to start with those simple hyperparameters:

HYPERPARAMETERS =
{'groups': [64, 32, 16, 8],
 'in_channels': [64],
 'kernel_size': [5, 10, 15, 20],
 'lr': [0.01, 0.1, 0.001],
 'n_classes': [4]}

And I defined the objective function as following:

Import pytorch_lightning as 
From torch.utils.data import DataLoader

def objective(hyperparams):
  trainset = EEGDataset(path, train_idx)#, transform=GraphFilter(filters, idx))
  testset = EEGDataset(path, test_idx)#, transform=GraphFilter(filters, idx))
  clear_output()
  train_loader = DataLoader(trainset, batch_size=32, shuffle=True)
  test_loader = DataLoader(testset, batch_size=32)

  model = BaselineModel(params)
  trainer = pl.Trainer(max_epochs=1)

  trainer.fit(model, train_loader, test_loader)
  value = trainer.callback_metrics['train_loss'].item()
  return [value]

And then started the optimization as following:


from mango import Tuner

print("Running hyperparameter search...")
config = dict()
config["optimizer"] = "Bayesian"
config["num_iteration"] = 5

tuner = Tuner(HYPERPARAMETERS, 
              objective=objective,
              conf_dict=config) 
results = tuner.minimize()

I tried 5 trials to check that everything is working before i start the real optimization. It worked at the begining them it gave me this error

ValueError: Found input variables with inconsistent numbers of samples: [7, 6]

I don’t understand where this error comes from, It would be very nice if you can help me with this issue.

sandeep-iitr commented 2 years ago

Hi, Thanks for your questions.

What this error means is Mango tried to train the CNN models 7 times, but it succeeded only 6 times. By default, Mango uses 2 random trainings and 5 iterations from your config["num_iteration"] parameter. This training failure can happen due to any number of reasons. You can put try-catch in python to see if you are getting any exception during CNN training.

The solution to this problem is very simple. The objective function signature can be modified to return only successful trainings/values. You can look at an example below where we introduced random failures 30% of the time, and return only successful parameters and their respective values.

https://github.com/ARM-software/mango/blob/master/examples/Failure_Handling.ipynb

Basically, the objective function returns two parameter lists now. In the above example they are:

hyper_evaluated, objective_evaluated

In the case where you are not able to train a particular CNN. You can add some high loss-function for those failures.

ahmedramly commented 2 years ago

Thanks for your 'very' quick and informative response, I really appreciate it. I added a try/except to the function and it worked very well.

def objective(hyperparams):
  trainset = EEGDataset(...)
  testset = EEGDataset(...)
  clear_output()
  train_loader = DataLoader(trainset, batch_size=32, shuffle=True)
  test_loader = DataLoader(testset, batch_size=32)

  hyper_eval = []
  obj_eval = []
  for param in hyperparams:
    try:
      model = BaselineModel(param)
      trainer = pl.Trainer(max_epochs=1, accelerator='auto')

      trainer.fit(model, train_loader, test_loader)
      value = trainer.callback_metrics['train_loss'].item()
      obj_eval.append(value)
      hyper_eval.append(param)
    except:
      print('Failed Evaluation')
      continue
  return hyper_eval, obj_eval

If you find something wrong with this implementation, please let me know. Thanks a lot

sandeep-iitr commented 2 years ago

Your implementation seems fine to me. If you face any more issues, please free to reopen this issue or create a new one.

ahmedramly commented 2 years ago

Thanks a lot, for sure.