Training data and evaluation data

zcysun commented 3 months ago

Hello!

I noticed that the maximum eposides can be controlled by MAX_EPISODES during training, and EVAL_INTERVAL determines the evaluation intervals; however, the evaluation process seems to determine the number of evaluation eposides by test_seeds:

seeds = [int(s) for s in test_seeds.split(',')] rewards, (vehicle_speed, vehicle_position), steps, avg_speeds = ma2c.evaluation(env, video_dir, len(seeds), is_train=False) (from run_ma2c.evaluate)

And in MAA2C, the evaluation eposides are fixed to 1:

def evaluation(self, env, output_dir, eval_episodes=1, is_train=True): (from MAA2C.evaluation)

So my question is:

For the graphs generated in the thesis, is the data source the data generated from the evaluations at intervals during the training process, or is it the data generated from the evaluation process?
What is the use of the data generated from the evaluation process?
Is it necessary to set the number of evaluation eposides the same as the number of training eposides (e.g., 20,000 eposides) to observe the data generated by the evaluation process?
Is the evaluation process only for the model at the last checkpoint after training?

Thank you for your answer, and good luck with your research!

DongChen06 commented 3 months ago

Hi thanks for your interest! The EVAL_INTERVAL means how many episodes we take in one evaluation, but in each evaluation, we need to do several runs to reduce the variances of the evaluation, which is 3. After training, you can do testing, for instance, we can run the trained algorithm 50 times, which is controlled by test_seeds.

zcysun commented 3 months ago

Thank you for your answer!

I seem to understand that during training, we evaluate once at intervals of EVAL_INTERVAL eposides and also run EVAL_EPISODES eposides for each evaluation to reduce the variance, which is where the data in the paper comes from;

But the point I am still confused about is, you said I can do testing after training, what is the purpose of testing in general, or what is it for?

I see that the evaluation function in MAA2C is called in the training function, but the evaluate function in run_maa2c doesn't seem to be called, what does it do?

Good luck with your research!

DongChen06 commented 3 months ago

Thank you for your answer!

I seem to understand that during training, we evaluate once at intervals of EVAL_INTERVAL eposides and also run EVAL_EPISODES eposides for each evaluation to reduce the variance, which is where the data in the paper comes from;

But the point I am still confused about is, you said I can do testing after training, what is the purpose of testing in general, or what is it for?

I see that the evaluation function in MAA2C is called in the training function, but the evaluate function in run_maa2c doesn't seem to be called, what does it do?

Good luck with your research!

For the evaluation of the training, we can see the training progress. For the testing after training, we can test the final/best model.

Could you show me the codes?

zcysun commented 3 months ago

Hello! I'm still debugging the source code you gave me for the MAA2C part, with a few simple modifications. First is run_ma2c.py, omitted part is the parameter setting part, modification part is mainly in the main function, training directly after the evaluation (in the source code you manually choose to evaluate or training).

def train(args): 
    ...
    while ma2c.n_episodes < MAX_EPISODES:
        ma2c.explore()
        if ma2c.n_episodes >= EPISODES_BEFORE_TRAIN:
            ma2c.train()

        if ma2c.episode_done and ((ma2c.n_episodes + 1) % EVAL_INTERVAL == 0):
            rewards, _, _, _ = ma2c.evaluation(env_eval, dirs['train_videos'], EVAL_EPISODES)

            # 求均值和标准差
            rewards_mu, rewards_std = agg_double_list(rewards)
            print("Episode %d, Average Reward %.2f" % (ma2c.n_episodes + 1, rewards_mu))
            episodes.append(ma2c.n_episodes + 1)
            eval_rewards.append(rewards_mu)
            # save the model
            if rewards_mu > best_eval_reward:
                ma2c.save(dirs['models'], 100000)
                ma2c.save(dirs['models'], ma2c.n_episodes + 1)
                best_eval_reward = rewards_mu
            else:
                ma2c.save(dirs['models'], ma2c.n_episodes + 1)

        # save training data
        # eval_rewards即为_rewars函数输出的均值
        np.save(train_logs + '/{}'.format('eval_rewards'), np.array(eval_rewards))
        # episode_rewards:每轮训练后得到的奖励值
        np.save(train_logs + '/{}'.format('episode_rewards'), np.array(ma2c.episode_rewards))
        np.save(train_logs + '/{}'.format('epoch_steps'), np.array(ma2c.epoch_steps))
        np.save(train_logs + '/{}'.format('average_speed'), np.array(ma2c.average_speed))

    # save the model
    ma2c.save(dirs['models'], MAX_EPISODES + 2)
    # 输出结果图
    plot_reward(train_logs)

    return output_dir

def evaluate(args):
    ...
    # load the model if exist
    ma2c.load(model_dir, train_mode=False)
    rewards, (vehicle_speed, vehicle_position), steps, avg_speeds = ma2c.evaluation(env, video_dir, len(seeds), is_train=False)
    rewards_mu, rewards_std = agg_double_list(rewards)
    success_rate = sum(np.array(steps) == 100) / len(steps)
    avg_speeds_mu, avg_speeds_std = agg_double_list(avg_speeds)

    print("Evaluation Reward and std %.2f, %.2f " % (rewards_mu, rewards_std))
    print("Collision Rate %.2f" % (1 - success_rate))
    print("Average Speed and std %.2f , %.2f " % (avg_speeds_mu, avg_speeds_std))

    np.save(eval_logs + '/{}'.format('eval_rewards'), np.array(rewards, dtype=object))
    np.save(eval_logs + '/{}'.format('eval_steps'), np.array(steps))
    np.save(eval_logs + '/{}'.format('eval_avg_speeds'), np.array(avg_speeds))
    np.save(eval_logs + '/{}'.format('vehicle_speed'), np.array(vehicle_speed,dtype=object))
    np.save(eval_logs + '/{}'.format('vehicle_position'), np.array(vehicle_position,dtype=object))

    # 数据可视化
    plot_avg_speeds(eval_logs)

if __name__ == "__main__":
    # 训练后直接评估
    args = parse_args()
    # train
    output_dir = train(args)
    # 更新目录参数
    args.model_dir = output_dir
    # evaluate
    evaluate(args)

MAA2C.py:

...
def train(self):
        if self.n_episodes <= self.episodes_before_train:
            pass
        batch = self.memory.sample(self.batch_size)
        states_var = to_tensor_var(batch.states, self.use_cuda).view(-1, self.n_agents, self.state_dim)
        action_masks_var = to_tensor_var(batch.action_masks, self.use_cuda).view(-1, self.n_agents, self.action_dim)
        actions_var = to_tensor_var(batch.actions, self.use_cuda).view(-1, self.n_agents, self.action_dim)
        rewards_var = to_tensor_var(batch.rewards, self.use_cuda).view(-1, self.n_agents, 1)
        whole_states_var = states_var.view(-1, self.n_agents * self.state_dim)

        for agent_id in range(self.n_agents):
            if not self.shared_network:
                # update actor network
                self.actor_optimizers.zero_grad()
                action_log_probs = self.actors(states_var[:, agent_id, :], action_masks_var[:, agent_id, :])
                entropy_loss = th.mean(entropy(th.exp(action_log_probs) + 1e-8))
                action_log_probs = th.sum(action_log_probs * actions_var[:, agent_id, :], 1)

                if self.training_strategy == "concurrent":
                    values = self.critics(states_var[:, agent_id, :])
                elif self.training_strategy == "centralized":
                    values = self.critics(whole_states_var)

                advantages = rewards_var[:, agent_id, :] - values.detach()
                pg_loss = -th.mean(action_log_probs * advantages)
                actor_loss = pg_loss - entropy_loss * self.entropy_reg
                actor_loss.backward()
                if self.max_grad_norm is not None:
                    nn.utils.clip_grad_norm_(self.actors.parameters(), self.max_grad_norm)
                self.actor_optimizers.step()

                # update critic network
                self.critic_optimizers.zero_grad()
                target_values = rewards_var[:, agent_id, :]
                if self.critic_loss == "huber":
                    critic_loss = nn.functional.smooth_l1_loss(values, target_values)
                else:
                    critic_loss = nn.MSELoss()(values, target_values)
                critic_loss.backward()
                if self.max_grad_norm is not None:
                    nn.utils.clip_grad_norm_(self.critics.parameters(), self.max_grad_norm)
                self.critic_optimizers.step()
            else:
                # update actor-critic network
                self.policy_optimizers.zero_grad()
                action_log_probs = self.policy(states_var[:, agent_id, :], action_masks_var[:, agent_id, :])
                entropy_loss = th.mean(entropy(th.exp(action_log_probs) + 1e-8))
                action_log_probs = th.sum(action_log_probs * actions_var[:, agent_id, :], 1)
                values = self.policy(states_var[:, agent_id, :], out_type='v')

                target_values = rewards_var[:, agent_id, :]
                if self.critic_loss == "huber":
                    critic_loss = nn.functional.smooth_l1_loss(values, target_values)
                else:
                    critic_loss = nn.MSELoss()(values, target_values)

                advantages = rewards_var[:, agent_id, :] - values.detach()
                pg_loss = -th.mean(action_log_probs * advantages)
                loss = pg_loss - entropy_loss * self.entropy_reg + critic_loss
                loss.backward()

                if self.max_grad_norm is not None:
                    nn.utils.clip_grad_norm_(self.policy.parameters(), self.max_grad_norm)
                self.policy_optimizers.step()

def evaluation(self, env, output_dir, eval_episodes=1, is_train=True):
        rewards = []
        infos = []
        avg_speeds = []
        steps = []
        vehicle_speed = []
        vehicle_position = []
        video_recorder = None
        seeds = [int(s) for s in self.test_seeds.split(',')]

        for i in range(eval_episodes):
            avg_speed = 0
            step = 0
            rewards_i = []
            infos_i = []
            done = False
            if is_train:
                if self.traffic_density == 1:
                    state, action_mask = env.reset(is_training=False, testing_seeds=seeds[i], num_CAV=i + 1)
                elif self.traffic_density == 2:
                    state, action_mask = env.reset(is_training=False, testing_seeds=seeds[i], num_CAV=i + 2)
                elif self.traffic_density == 3:
                    state, action_mask = env.reset(is_training=False, testing_seeds=seeds[i], num_CAV=i + 4)
            else:
                state, action_mask = env.reset(is_training=False, testing_seeds=seeds[i])

            n_agents = len(env.controlled_vehicles)
            # rendered_frame = env.render(mode="rgb_array")
            # video_filename = os.path.join(output_dir,
            #                               "testing_episode{}".format(self.n_episodes + 1) + '_{}'.format(i) +
            #                               '.mp4')
            # # Init video recording
            # if video_filename is not None:
            #     print("Recording video to {} ({}x{}x{}@{}fps)".format(video_filename, *rendered_frame.shape,
            #                                                           5))
            #     video_recorder = VideoRecorder(video_filename,
            #                                    frame_size=rendered_frame.shape, fps=5)
            #     video_recorder.add_frame(rendered_frame)
            # else:
            #     video_recorder = None

            while not done:
                step += 1
                action = self.action(state, n_agents, action_mask)
                state, reward, done, info = env.step(action)
                action_mask = info["action_mask"]
                avg_speed += info["average_speed"]
                # rendered_frame = env.render(mode="rgb_array")
                if video_recorder is not None:
                    video_recorder.add_frame(rendered_frame)

                rewards_i.append(reward)
                infos_i.append(info)

            vehicle_speed.append(info["vehicle_speed"])
            vehicle_position.append(info["vehicle_position"])
            rewards.append(rewards_i)
            infos.append(infos_i)
            steps.append(step)
            avg_speeds.append(avg_speed / step)

        if video_recorder is not None:
            video_recorder.release()
        env.close()
        return rewards, (vehicle_speed, vehicle_position), steps, avg_speeds

...

If you need a file for more details, I can also upload the relevant file.

Just like the above code, you can see that ma2c.train is called in the train function, but you don't see ma2c.evaluate called in the following evaluate(). so I'm rather puzzled as to what is the purpose of the evaluate() function in run_ma2c.

Best wishes!

DongChen06 commented 3 months ago

the evaluation function is called in the following lines:

zcysun commented 3 months ago

I'm sorry for the misunderstanding! What I meant was that the evaluate(args) function in run_ma2c.py is not called, the evaluation function in MAA2C is.

DongChen06 commented 3 months ago

You may check these lines:

zcysun commented 3 months ago

Thank you for pointing that out!

So, the evaluate() function is actually not that important and is mainly used to test the final/best model. The primary data is still obtained through the Train function, correct?

Best wishes!

DongChen06 commented 3 months ago

Thank you for pointing that out!

So, the evaluate() function is actually not that important and is mainly used to test the final/best model. The primary data is still obtained through the Train function, correct?

Best wishes!

Yes, you are right

zcysun commented 3 months ago

Thank you for your patience and I wish you a wonderful day! Happy Dragon Boat Festival!

DongChen06 / MARL_CAVs

Training data and evaluation data #43