RealVNF / distributed-drl-coordination

Distributed Online Service Coordination Using Deep Reinforcement Learning
19 stars 6 forks source link

Questions about parameter settings and code #8

Closed daiwenlong23 closed 1 week ago

daiwenlong23 commented 1 week ago

Hello Stefan! We are happy to find your interesting work, and we are conducting further experiments based on your project, I met some problems during the implementation and I carefully read your paper but didn’t find the answer:

  1. In the Action space, the number i means the node's i-th neighbor, but how to determine which node is the i-th neighbor? Does they were sorted directly by ID, or sorted by resource capacity or distance?
  2. In the agent’s setup yaml file, n_steps is set to 20, does n_steps have any special meaning? Or is it just used for ease of calculation and storage? My understanding is that in these 20 steps, several traffics may actually be processed or not yet, and in the next 20 steps, does the traffics will be reset and re-enter the network through ingress?
  3. I don't really understand "mus", "done", and "masks" in the state space definition, do these necessary or I can replace with other variables?
stefanbschneider commented 1 week ago

Hi @daiwenlong23 , I'm happy to hear you are looking into our work. It has been some years since I worked on this, but I will try to answer all your questions.

In the Action space, the number i means the node's i-th neighbor, but how to determine which node is the i-th neighbor? Does they were sorted directly by ID, or sorted by resource capacity or distance?

We ordered the actions based on the node IDs. I think another order would also work. What's important is that, a) the order is stable and b) the same stable order is used for ordering the node-based observations (eg, node utilization). That way, the agents should learn to connect observed node utilization with the corresponding action.

In the agent’s setup yaml file, n_steps is set to 20, does n_steps have any special meaning? Or is it just used for ease of calculation and storage? My understanding is that in these 20 steps, several traffics may actually be processed or not yet, and in the next 20 steps, does the traffics will be reset and re-enter the network through ingress?

n_steps is just a hyper-parameter for training:

# The number of steps to run for each environment per update
# (i.e. batch size is n_steps * n_env where n_env is number of environment copies running in parallel)
n_steps: 20

So it only affects how many steps the environment is run before they are batched together for training. It does not affect the simulated traffic, only the training. You can play around with different values to see if/how it influences training speed and stability.

I don't really understand "mus", "done", and "masks" in the state space definition, do these necessary or I can replace with other variables?

I'm not sure what exactly you are referring to. Could you share a link to the code?

daiwenlong23 commented 1 week ago

oh! Thank you very much for your reply, for the first two questions I am clear. For the third question, I was asking that in my experience repository, experiences are stored as (obs, actions, rewards, mus, dones, masks) and the first 3 parameters represent the state, action and corresponding rewards respectively, I don't quite understand the purpose of the last 3 parameters. Also, can I ask how the rewards are calculated and stored in the episode_reward.csv file in your code? Because if I set training_duration=20000, my episode_reward.csv file saves about 50 [episodes, reward], but I didn't find the function that calculates and saves the reward specifically. Thank you for your time, and I look forward to your response.

daiwenlong23 commented 1 week ago

I'm not sure what exactly you are referring to. Could you share a link to the code?

like this :obs, actions, rewards, mus, dones, masks = buffer.get()

stefanbschneider commented 1 week ago

Hm, I'm not sure if I can help much regarding these questions.

obs, actions, rewards, mus, dones, masks = buffer.get()

This seems to be from some other repository. I think there are different implementations of how experiences are saved in the buffer. obs, actions, reward are always there. dones represents whether the episode is completed. But I don't know what mus and masks mean here.

Also, can I ask how the rewards are calculated and stored in the episode_reward.csv file in your code?

I don't remember, but searching through the code, it seems like we write these rewards here: https://github.com/RealVNF/distributed-drl-coordination/blob/f4ddf2eeefcde0bd4445b1e2bd1133c83611bfdf/src/spr_rl/agent/acktr_agent.py#L117

As I understand, we first train the agent and then test the trained agent for one episode. In this testing episode, the reward is written for each step. What did you set as testing duration in the config?

But I'm not entirely sure to be honest.

daiwenlong23 commented 1 week ago

Hm, I'm not sure if I can help much regarding these questions.

obs, actions, rewards, mus, dones, masks = buffer.get()

This seems to be from some other repository. I think there are different implementations of how experiences are saved in the buffer. obs, actions, reward are always there. dones represents whether the episode is completed. But I don't know what mus and masks mean here. I know, thank you.

Also, can I ask how the rewards are calculated and stored in the episode_reward.csv file in your code?

I don't remember, but searching through the code, it seems like we write these rewards here:

https://github.com/RealVNF/distributed-drl-coordination/blob/f4ddf2eeefcde0bd4445b1e2bd1133c83611bfdf/src/spr_rl/agent/acktr_agent.py#L117

As I understand, we first train the agent and then test the trained agent for one episode. In this testing episode, the reward is written for each step. What did you set as testing duration in the config?

Thank you very much for your patience, I will carefully consider the question about the reward function calculation, and I will follow up with you actively if you wish. I have another crucial question: in your paper, you mentioned that if the current node selects a neighbor node that does not exist, then it will be penalized. I would like to ask what is the next state that the current environment state jumps to in your code by selecting a non-existing neighbor as an action? Or which node it will jump to? My guess is to stay at the original node, not sure if that's right?

stefanbschneider commented 1 week ago

Yes, it stays at the same node if it selects a non-existing neighbor.

Happy to hear that I could help you. I'm closing this issue for now if it's ok. Feel free to reopen if you have follow-up questions.

If you publish anything related, I would be thankful for a citation/reference of my paper :) https://github.com/RealVNF/distributed-drl-coordination?tab=readme-ov-file#citation

daiwenlong23 commented 1 week ago

Yes, it stays at the same node if it selects a non-existing neighbor. ok, I know. Happy to hear that I could help you. I'm closing this issue for now if it's ok. Feel free to reopen if you have follow-up questions. Ok, thank you very much for your help. If you publish anything related, I would be thankful for a citation/reference of my paper :) https://github.com/RealVNF/distributed-drl-coordination?tab=readme-ov-file#citation Of course, I will cite your excellent paper in my research work, thanks again.