Open zcysun opened 3 months ago
Hi thanks for your interest! The EVAL_INTERVAL means how many episodes we take in one evaluation, but in each evaluation, we need to do several runs to reduce the variances of the evaluation, which is 3. After training, you can do testing, for instance, we can run the trained algorithm 50 times, which is controlled by test_seeds.
Thank you for your answer!
I seem to understand that during training, we evaluate once at intervals of EVAL_INTERVAL eposides and also run EVAL_EPISODES eposides for each evaluation to reduce the variance, which is where the data in the paper comes from;
But the point I am still confused about is, you said I can do testing after training, what is the purpose of testing in general, or what is it for?
I see that the evaluation function in MAA2C is called in the training function, but the evaluate function in run_maa2c doesn't seem to be called, what does it do?
Good luck with your research!
Thank you for your answer!
I seem to understand that during training, we evaluate once at intervals of EVAL_INTERVAL eposides and also run EVAL_EPISODES eposides for each evaluation to reduce the variance, which is where the data in the paper comes from;
But the point I am still confused about is, you said I can do testing after training, what is the purpose of testing in general, or what is it for?
I see that the evaluation function in MAA2C is called in the training function, but the evaluate function in run_maa2c doesn't seem to be called, what does it do?
Good luck with your research!
For the evaluation of the training, we can see the training progress. For the testing after training, we can test the final/best model.
Could you show me the codes?
Hello! I'm still debugging the source code you gave me for the MAA2C part, with a few simple modifications. First is run_ma2c.py, omitted part is the parameter setting part, modification part is mainly in the main function, training directly after the evaluation (in the source code you manually choose to evaluate or training).
def train(args):
...
while ma2c.n_episodes < MAX_EPISODES:
ma2c.explore()
if ma2c.n_episodes >= EPISODES_BEFORE_TRAIN:
ma2c.train()
if ma2c.episode_done and ((ma2c.n_episodes + 1) % EVAL_INTERVAL == 0):
rewards, _, _, _ = ma2c.evaluation(env_eval, dirs['train_videos'], EVAL_EPISODES)
# 求均值和标准差
rewards_mu, rewards_std = agg_double_list(rewards)
print("Episode %d, Average Reward %.2f" % (ma2c.n_episodes + 1, rewards_mu))
episodes.append(ma2c.n_episodes + 1)
eval_rewards.append(rewards_mu)
# save the model
if rewards_mu > best_eval_reward:
ma2c.save(dirs['models'], 100000)
ma2c.save(dirs['models'], ma2c.n_episodes + 1)
best_eval_reward = rewards_mu
else:
ma2c.save(dirs['models'], ma2c.n_episodes + 1)
# save training data
# eval_rewards即为_rewars函数输出的均值
np.save(train_logs + '/{}'.format('eval_rewards'), np.array(eval_rewards))
# episode_rewards:每轮训练后得到的奖励值
np.save(train_logs + '/{}'.format('episode_rewards'), np.array(ma2c.episode_rewards))
np.save(train_logs + '/{}'.format('epoch_steps'), np.array(ma2c.epoch_steps))
np.save(train_logs + '/{}'.format('average_speed'), np.array(ma2c.average_speed))
# save the model
ma2c.save(dirs['models'], MAX_EPISODES + 2)
# 输出结果图
plot_reward(train_logs)
return output_dir
def evaluate(args):
...
# load the model if exist
ma2c.load(model_dir, train_mode=False)
rewards, (vehicle_speed, vehicle_position), steps, avg_speeds = ma2c.evaluation(env, video_dir, len(seeds), is_train=False)
rewards_mu, rewards_std = agg_double_list(rewards)
success_rate = sum(np.array(steps) == 100) / len(steps)
avg_speeds_mu, avg_speeds_std = agg_double_list(avg_speeds)
print("Evaluation Reward and std %.2f, %.2f " % (rewards_mu, rewards_std))
print("Collision Rate %.2f" % (1 - success_rate))
print("Average Speed and std %.2f , %.2f " % (avg_speeds_mu, avg_speeds_std))
np.save(eval_logs + '/{}'.format('eval_rewards'), np.array(rewards, dtype=object))
np.save(eval_logs + '/{}'.format('eval_steps'), np.array(steps))
np.save(eval_logs + '/{}'.format('eval_avg_speeds'), np.array(avg_speeds))
np.save(eval_logs + '/{}'.format('vehicle_speed'), np.array(vehicle_speed,dtype=object))
np.save(eval_logs + '/{}'.format('vehicle_position'), np.array(vehicle_position,dtype=object))
# 数据可视化
plot_avg_speeds(eval_logs)
if __name__ == "__main__":
# 训练后直接评估
args = parse_args()
# train
output_dir = train(args)
# 更新目录参数
args.model_dir = output_dir
# evaluate
evaluate(args)
MAA2C.py:
...
def train(self):
if self.n_episodes <= self.episodes_before_train:
pass
batch = self.memory.sample(self.batch_size)
states_var = to_tensor_var(batch.states, self.use_cuda).view(-1, self.n_agents, self.state_dim)
action_masks_var = to_tensor_var(batch.action_masks, self.use_cuda).view(-1, self.n_agents, self.action_dim)
actions_var = to_tensor_var(batch.actions, self.use_cuda).view(-1, self.n_agents, self.action_dim)
rewards_var = to_tensor_var(batch.rewards, self.use_cuda).view(-1, self.n_agents, 1)
whole_states_var = states_var.view(-1, self.n_agents * self.state_dim)
for agent_id in range(self.n_agents):
if not self.shared_network:
# update actor network
self.actor_optimizers.zero_grad()
action_log_probs = self.actors(states_var[:, agent_id, :], action_masks_var[:, agent_id, :])
entropy_loss = th.mean(entropy(th.exp(action_log_probs) + 1e-8))
action_log_probs = th.sum(action_log_probs * actions_var[:, agent_id, :], 1)
if self.training_strategy == "concurrent":
values = self.critics(states_var[:, agent_id, :])
elif self.training_strategy == "centralized":
values = self.critics(whole_states_var)
advantages = rewards_var[:, agent_id, :] - values.detach()
pg_loss = -th.mean(action_log_probs * advantages)
actor_loss = pg_loss - entropy_loss * self.entropy_reg
actor_loss.backward()
if self.max_grad_norm is not None:
nn.utils.clip_grad_norm_(self.actors.parameters(), self.max_grad_norm)
self.actor_optimizers.step()
# update critic network
self.critic_optimizers.zero_grad()
target_values = rewards_var[:, agent_id, :]
if self.critic_loss == "huber":
critic_loss = nn.functional.smooth_l1_loss(values, target_values)
else:
critic_loss = nn.MSELoss()(values, target_values)
critic_loss.backward()
if self.max_grad_norm is not None:
nn.utils.clip_grad_norm_(self.critics.parameters(), self.max_grad_norm)
self.critic_optimizers.step()
else:
# update actor-critic network
self.policy_optimizers.zero_grad()
action_log_probs = self.policy(states_var[:, agent_id, :], action_masks_var[:, agent_id, :])
entropy_loss = th.mean(entropy(th.exp(action_log_probs) + 1e-8))
action_log_probs = th.sum(action_log_probs * actions_var[:, agent_id, :], 1)
values = self.policy(states_var[:, agent_id, :], out_type='v')
target_values = rewards_var[:, agent_id, :]
if self.critic_loss == "huber":
critic_loss = nn.functional.smooth_l1_loss(values, target_values)
else:
critic_loss = nn.MSELoss()(values, target_values)
advantages = rewards_var[:, agent_id, :] - values.detach()
pg_loss = -th.mean(action_log_probs * advantages)
loss = pg_loss - entropy_loss * self.entropy_reg + critic_loss
loss.backward()
if self.max_grad_norm is not None:
nn.utils.clip_grad_norm_(self.policy.parameters(), self.max_grad_norm)
self.policy_optimizers.step()
def evaluation(self, env, output_dir, eval_episodes=1, is_train=True):
rewards = []
infos = []
avg_speeds = []
steps = []
vehicle_speed = []
vehicle_position = []
video_recorder = None
seeds = [int(s) for s in self.test_seeds.split(',')]
for i in range(eval_episodes):
avg_speed = 0
step = 0
rewards_i = []
infos_i = []
done = False
if is_train:
if self.traffic_density == 1:
state, action_mask = env.reset(is_training=False, testing_seeds=seeds[i], num_CAV=i + 1)
elif self.traffic_density == 2:
state, action_mask = env.reset(is_training=False, testing_seeds=seeds[i], num_CAV=i + 2)
elif self.traffic_density == 3:
state, action_mask = env.reset(is_training=False, testing_seeds=seeds[i], num_CAV=i + 4)
else:
state, action_mask = env.reset(is_training=False, testing_seeds=seeds[i])
n_agents = len(env.controlled_vehicles)
# rendered_frame = env.render(mode="rgb_array")
# video_filename = os.path.join(output_dir,
# "testing_episode{}".format(self.n_episodes + 1) + '_{}'.format(i) +
# '.mp4')
# # Init video recording
# if video_filename is not None:
# print("Recording video to {} ({}x{}x{}@{}fps)".format(video_filename, *rendered_frame.shape,
# 5))
# video_recorder = VideoRecorder(video_filename,
# frame_size=rendered_frame.shape, fps=5)
# video_recorder.add_frame(rendered_frame)
# else:
# video_recorder = None
while not done:
step += 1
action = self.action(state, n_agents, action_mask)
state, reward, done, info = env.step(action)
action_mask = info["action_mask"]
avg_speed += info["average_speed"]
# rendered_frame = env.render(mode="rgb_array")
if video_recorder is not None:
video_recorder.add_frame(rendered_frame)
rewards_i.append(reward)
infos_i.append(info)
vehicle_speed.append(info["vehicle_speed"])
vehicle_position.append(info["vehicle_position"])
rewards.append(rewards_i)
infos.append(infos_i)
steps.append(step)
avg_speeds.append(avg_speed / step)
if video_recorder is not None:
video_recorder.release()
env.close()
return rewards, (vehicle_speed, vehicle_position), steps, avg_speeds
...
If you need a file for more details, I can also upload the relevant file.
Just like the above code, you can see that ma2c.train is called in the train function, but you don't see ma2c.evaluate called in the following evaluate(). so I'm rather puzzled as to what is the purpose of the evaluate() function in run_ma2c.
Best wishes!
the evaluation function is called in the following lines:
I'm sorry for the misunderstanding! What I meant was that the evaluate(args) function in run_ma2c.py is not called, the evaluation function in MAA2C is.
You may check these lines:
Thank you for pointing that out!
So, the evaluate() function is actually not that important and is mainly used to test the final/best model. The primary data is still obtained through the Train function, correct?
Best wishes!
Thank you for pointing that out!
So, the evaluate() function is actually not that important and is mainly used to test the final/best model. The primary data is still obtained through the Train function, correct?
Best wishes!
Yes, you are right
Thank you for your patience and I wish you a wonderful day! Happy Dragon Boat Festival!
Hello!
I noticed that the maximum eposides can be controlled by MAX_EPISODES during training, and EVAL_INTERVAL determines the evaluation intervals; however, the evaluation process seems to determine the number of evaluation eposides by test_seeds:
seeds = [int(s) for s in test_seeds.split(',')] rewards, (vehicle_speed, vehicle_position), steps, avg_speeds = ma2c.evaluation(env, video_dir, len(seeds), is_train=False)
(from run_ma2c.evaluate)And in MAA2C, the evaluation eposides are fixed to 1:
def evaluation(self, env, output_dir, eval_episodes=1, is_train=True):
(from MAA2C.evaluation)So my question is:
Thank you for your answer, and good luck with your research!