DLR-RM / rl-baselines3-zoo

A training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.
https://rl-baselines3-zoo.readthedocs.io
MIT License
1.89k stars 494 forks source link

[Question] The trained agent resets every 1000 episodes. #451

Closed SKYLEO98 closed 1 month ago

SKYLEO98 commented 1 month ago

❓ Question

I trained my custom env with tqc algorithm. However, the trained agent keeps reset every 1000 episodes even though I did not set number of episode for a reset. Is it relevant to hyper-parameter or hidden feature that I could tune?

Checklist

SKYLEO98 commented 1 month ago

Hello, I could not find how we are able to modify the default 1000 episode length when we use "enjoy" command to test a trained agent.

''' python3 enjoy.py --algo tqc --env gym_hexapod_zoo-v0 -f logs/ --exp-id 5 --load-best -n 5000000 '''

hyperparmeter

''' gym_hexapod_zoo-v0: n_timesteps: !!float 2e6 policy: 'MlpPolicy' learning_rate: !!float 3e-4 buffer_size: 100000 batch_size: 256 ent_coef: 'auto' train_freq: 1 gradient_steps: 1 learning_starts: 10000 ''' algorithm is tqc

The case study is to train a hexapod robot to follow a path, for instance an eight-shaped path. The path is generated as follows: ''' self.time = np.linspace(0, 2*np.pi, 1000) def _generate_eight_shape_path(self,time):

Define the scale factors for the x and y coordinates

    scale_x = 4  # Scale factor for x-coordinate
    scale_y = 6 # Scale factor for y-coordinate

    # Define x and y coordinates for the 8-shaped path with scaled dimensions
    x = scale_x * np.sin(time)
    y = scale_y * np.sin(time) * np.cos(time)
    return x, y

''' Basically, I ignore the time cost as robot could reach the consecutive goals with self.time is 1000. After training, the trained agent will conduct path tracking task to control the hexapod follow that path. As one episode will be determined by multiple of self.time and maximum allowance steps to reach an sub goal based on world Cartesian coordinate (10000 steps for each episode). A time flag will count up to shift the next goal until it reaches 1000 to be reset again. However, the enjoy command always reset after it reach 1000 episode length. I could not get this bug since the actual running did not even complete one episode.

How could I modify this parameter or track how it counted 1000 episodes.

Many thanks in advance.

araffin commented 1 month ago

you mean 1000 steps? you probably defined a max episode steps when registering your env.

SKYLEO98 commented 1 month ago

That's exactly issue. I almost forgot this configuration. Thanks a lot for your correction.

''' from gymnasium.envs.registration import register register( id="gym_hexapod_zoo-v0", entry_point="gym_hexapod_zoo.envs:gym_hexapod_zoo", max_episode_steps=1000, )