ai4co / rl4co

A PyTorch library for all things Reinforcement Learning (RL) for Combinatorial Optimization (CO)
https://rl4.co
MIT License
389 stars 71 forks source link

Error while running EAS #218

Open ujjwaldasari10 opened 2 days ago

ujjwaldasari10 commented 2 days ago

Describe the bug

I am not able to train AM model with EAS using the link given here: https://rl4.co/examples/modeling/2-transductive-methods/#perform-search.

To Reproduce

Steps to reproduce the behavior.

Please try to provide a minimal example to reproduce the bug. Error messages and stack traces are also helpful.

Please use the markdown code blocks for both code and stack traces.

import torch

# Move the model to the device before initializing the trainer
policy = policy.to(device)

trainer = RL4COTrainer(
    max_epochs=1,
    gradient_clip_val=None,
    strategy='ddp_notebook'
)
trainer.fit(eas_model)

RuntimeError Traceback (most recent call last) Cell In[9], line 11 4 policy = policy.to(device) 6 trainer = RL4COTrainer( 7 max_epochs=1, 8 gradient_clip_val=None, 9 strategy='ddp_notebook' 10 ) ---> 11 trainer.fit(eas_model)

File ~/miniconda3/envs/rl4co/lib/python3.10/site-packages/rl4co/utils/trainer.py:146, in RL4COTrainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path) 141 log.warning( 142 "Overriding gradient_clip_val to None for 'automatic_optimization=False' models" 143 ) 144 self.gradient_clip_val = None --> 146 super().fit( 147 model=model, 148 train_dataloaders=train_dataloaders, 149 val_dataloaders=val_dataloaders, 150 datamodule=datamodule, 151 ckpt_path=ckpt_path, 152 )

File ~/miniconda3/envs/rl4co/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:543, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path) ... 206 if _IS_INTERACTIVE: 207 message += " You will have to restart the Python kernel." --> 208 raise RuntimeError(message)

RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call torch.cuda.* functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel. Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

Can you please help figure out the issue.

fedebotu commented 2 days ago

Hi @ujjwaldasari10 , that can happen if you had the model (or part of it) already cast to the device prior to the trainer. Here are some ideas:

  1. Can you try to remove the call to policy.to(device) firstly?
  2. I recommend training the model and collecting the checkpoint from a separate notebook / script, and load it as done here
  3. There might be an issue with ddp_notebook. I recommend trying without that, and setting devices=1 instead