Training of PPO freezes after number of iterations

Ahmed-Radwan094 commented 3 months ago

🐛 Bug

I built a custom Carla environment and implemented a script to train it with PPO. The training runs without any errors, however after a number of iterations, typically 50-60k, the code freezes. I validated that the code is stuck and doesn't call step function of the environment anymore.

Code example

try:
    carla_env = Carla(config)

    # create a custom feature extractor in stable baselines
    policy_kwargs = dict(
        features_extractor_class=CarlaDrivingPolicy,
        features_extractor_kwargs=dict(config=config),
    )

    model = PPO("MultiInputPolicy", carla_env, policy_kwargs=policy_kwargs,
                **config['RL']['PPO_algo_params'], verbose=1)

    # set up the model logger
    logger_path = config['RL']['logger_path']
    logger_object = configure(logger_path, ["stdout", "csv", "tensorboard"])
    model.set_logger(logger_object)

    # define the model path
    model_path = config['RL']['model_path']
    # set up checkpoint callback to save model at a certain frequency
    checkpoint_callback = CheckpointCallback(
        save_freq=config['RL']['checkpoint_save_freq'] // n_envs,
        save_path=model_path,
        name_prefix=rl_algorithm + "_carla"
    )

    # train the agent
    model.learn(**config['RL']['train_params'], log_interval=10, callback=checkpoint_callback)
    # close carla environment
    print("Training complete")
    carla_env.close()
# handle exception
except Exception:
    if carla_env:
        carla_env.close()
    traceback.print_exc()

Relevant log output / Error message

No response

System Info

OS: Linux-5.15.0-101-generic-x86_64-with-glibc2.17 # 111~20.04.1-Ubuntu SMP Mon Mar 11 15:44:43 UTC 2024
Python: 3.8.19
Stable-Baselines3: 2.2.1
PyTorch: 2.2.1+cu118
GPU Enabled: True
Numpy: 1.24.3
Cloudpickle: 2.2.1
Gymnasium: 0.28.1
OpenAI Gym: 0.26.2

Checklist

[X] I have checked that there is no similar issue in the repo
[X] I have read the documentation
[X] I have provided a minimal and working example to reproduce the bug
[X] I have checked my env using the env checker
[X] I've used the markdown code blocks for both code and stack traces.

qgallouedec commented 3 months ago

Hey, have you checked you env with the env checker? Can you share the logs? What do you mean by freeze?

Ahmed-Radwan094 commented 3 months ago

Hey, thank you for the quick reply. I checked the env with env checker and I only received one warning from the action type casting.

/home/ahmed/miniconda3/envs/baselines_env/lib/python3.8/site-packages/gymnasium/spaces/box.py:130: UserWarning: WARN: Box bound precision lowered by casting to float32 gym.logger.warn(f"Box bound precision lowered by casting to {self.dtype}")

Ahmed-Radwan094 commented 3 months ago

By freeze, I mean the learn function is stuck in one step for more than an hour. There are no errors or process killed, and I verified Carla is alive and can be pinged.

qgallouedec commented 3 months ago

Have you tried to use your debugger and pause the process to see which line is involved?

Ahmed-Radwan094 commented 3 months ago

No, I didn't. The problem is this happens after large number of iterations, around 50-60k (random between each run) and I am not sure if it would be feasible to debug. I have verified that the step function is called, and suddenly it stops being called, and no new commands are received. Is there a way to log information about the current function being called in model.learn(...)

qgallouedec commented 3 months ago

I don't know which debugger you use, but you can usually manually pause whenever you want. As this is a custom environment, your best bet is to reduce your code as much as possible to converge on an MRE.

Ahmed-Radwan094 commented 3 months ago

I will try that and update the ticket. Thank you for the support.

Ahmed-Radwan094 commented 3 months ago

There was an exception in the environment itself and now it is fixed.

DLR-RM / stable-baselines3