baturaysaglam / RIS-MISO-Deep-Reinforcement-Learning

Joint Transmit Beamforming and Phase Shifts Design with Deep Reinforcement Learning
MIT License
139 stars 39 forks source link

On the problem of drawing reward to steps diagram #3

Closed Hawkingfans closed 2 years ago

Hawkingfans commented 2 years ago

I'm having some problems when I'm running a simulation with your code. For example, I want to reproduce Figure 7 in the paper The size of the trained instant rewards is (num episode, num steps per eps) I know that each row should represent the reward for that episode.

But I loaded your file on github and found that the size is (1, num steps per eps) I would like to ask if you took the first episode for drawing? Otherwise, I found that the reward hardly increased until the last episode.

Sorry to bother you, I'd be very grateful if you could help me out!

baturaysaglam commented 2 years ago

yes, you are right. the size of the trained instant rewards is (# of episodes, # of steps per episode). I've just updated the repository but missed correcting the size of the instant rewards in the reproduction module. when you load the file the size is indeed (1, # of steps per episode), which is averaged over the trained episodes. sorry for missing indicating this.

in conclusion, if you want to train the model from scratch, the rewards are stored as (# of episodes, # of steps per episode). If you want to reproduce the results reported in the paper, you get (1, # of steps per episode), being the average over the trained episodes.

what do you mean by "Otherwise, I found that the reward hardly increased until the last episode."?

Hawkingfans commented 2 years ago

What does averaged over the trained episodes mean, and how is this calculated?

Because I printed out the instant reward and found that when I found the first episode, the reward would increase with the increase of the step, but by the last episode it was almost maintained at about 2 and did not grow.

擷取

Therefore, it is not clear how to take the reward value to reproduce the simulated image of the paper.

Thank you very much for your kind help

baturaysaglam commented 2 years ago

sorry, I was completely confused in my previous answer. averaging is as follows. let's say you train the model for 20 episodes of 10 random seeds. this means that for each random seed, the agent is initialized from a fresh start and trained for 20 episodes. let's also suppose that each episode takes 50,000 time steps. therefore, you would have 20 x 10 instant rewards of size 50,000. usually, in deep RL, you evaluate your model on a distinct evaluation environment, and the rewards achieved in such environment constitute the learning curves. however, the authors missed that and evaluate their method incorrectly. what I did is I averaged 20 x 10 = 200 instant reward arrays of size 50,000 to have a single 50,000 sized reward array.

getting to your question, I can say that DDPG is now old-fashioned, it can suddenly diverge and not learn at all. this is a deep RL-related problem, you don't need to worry. just try different random seeds for which the agent converges.

Hawkingfans commented 2 years ago

Got it, thanks for your training advice However, I still have some doubts, because I observed that it is difficult to increase the reward of almost the later episodes, so even if I take the average, the curve should still be difficult to reproduce. Figure 7 on the paper gradually increases and flattens.

baturaysaglam commented 2 years ago

I don't think some of the results in the paper are reliable, and Figure 7 is one of them. DDPG couldn't solve the environment for some seeds, as I said before, so what you are getting is correct. A more powerful algorithm can solve though.

Check out my this repo. Just switch the parameters mismatch, channel_est_error, and cascaded_channels to False, set policy to SAC, and objective_function to golden_standard. Train the model from scratch for a single episode of 20,000 time steps. Then, you would obtain the same setting as in this repo. SAC is a state-of-the-art deep RL algorithm that can substantially outperform DDPG. You would obtain better results, expected to converge for almost all seeds. This shows that DDPG is unable the solve the environment robustly.

Hawkingfans commented 2 years ago

That's it, then I'll study how SAC works. Thank you so much for your kind help!