chongminggao / EasyRL4Rec

MIT License
60 stars 12 forks source link

Test ENV #8

Open jieWANGforwork opened 1 month ago

jieWANGforwork commented 1 month ago

Hi thanks for your great contribution, Can I ask several questions about how you evaluate the policies to get results in Table4?

  1. Do you use the test env that is different with train env to evaluate the rewards? I cannot find any clear illustration of this part. if so, how do you establish the test env, and where can I find the code.
  2. You said 'For the evaluation process, after each epoch, the policy undergoes evaluation using 100 episodes (i.e., interaction trajectories)' Where do the 100 episodes come from? Are they randomly from the replay buffer or they start by random user and generate new sequences for evaluation? If neither, can you provide the details.

Thank you so much and really expect you answers.

Best

yuyq18 commented 1 month ago

Hi! Thanks for your attention to our work.

Here are my answers:

  1. To avoid data leakage, we build the training and test environments separately from the training set and test set for the same dataset, both following the same construction methods. The only difference is that the test environments are built in a DummyVectorEnv class to test 100 episodes simultaneously. The corresponding code can be found at examples/policy/run_x.py (x can be any policy, e.g., A2C) and the prepare_test_envs function in examples/policy/policy_utils.py:
# code in examples/policy/run_x.py
# %% 2. Prepare user model and environment
    ensemble_models = prepare_user_model(args)
    env, dataset, kwargs_um = get_true_env(args)
    train_envs = prepare_train_envs(args, ensemble_models, env, dataset, kwargs_um)
    test_envs_dict = prepare_test_envs(args, env, kwargs_um)

# code in examples/policy/policy_utils.py
    test_envs = DummyVectorEnv([lambda: env_task_class(**kwargs_um) for _ in range(args.test_num)])
# the env_task_class refers to the specific dataset, e.g. Coat. 

As for the environments for different datasets, you can find the corresponding code in core/envs and core/util/data.py.

  1. As mentioned above, we evaluate the model in 100 test environments in parallel. In each episode, the user is randomly selected and begins a new interaction sequence. You can find the data in core/envs/BaseEnv.py:

    # the initial user and item are generated randomly
     self.cur_user = self.__user_generator()
     self.action = self.__item_generator()

Hope my answers can help you!