BorealisAI / mtmfrl

Multi Type Mean Field Reinforcement Learning
Other
29 stars 10 forks source link

Implementation of Boltzmann exploration #5

Closed guestreturn closed 2 years ago

guestreturn commented 2 years ago

Hello! Your work on MTMFQ has solved the heterogeneous agents problem, and I have learned a lot from it. However, I do not quite understand the implementation of Boltzmann exploration in the code. https://github.com/BorealisAI/mtmfrl/blob/d2d44796ce76a0eb25303505e272d30c201c75bc/multigather/mfrl/examples/battle_model/algo/base.py#L146 According to the blog and the interactive version, Boltzmann exploration approach involves choosing an action with weighted probabilities instead of always taking the optimal action. The original MFRL is also to select the optimal action directly instead of to sample from a certain distribution. So in my opinion, the Boltzmann exploration approach is not really used in mean-field methods.

Do you think my understanding of this is correct?

Sriram94 commented 2 years ago

Hello, thank you for your question. There is a subtle difference in the terms "Boltzmann exploration" and "Boltzmann policy". In the MFRL and our paper, we are using the "Boltzmann policy", where the form of the update corresponds to the Boltzmann update as provided in the Eq. 23 of our paper (or Eq. 12 in MFRL), and the intention is not to induce any exploration in the update. Empirically, an exploration does not need to be induced in the mean field updates since there are many agents in the environment and that provides sufficient stochasticity on its own. Moreover, in our paper and MFRL, the updates guarantee the Greedy in the Limit with Infinite Exploration (GLIE) assumption, which assures that the optimal action is chosen from the given available set of actions.

Hope that answered your question.

guestreturn commented 2 years ago

Thanks for your reply! However, I still do not understand why the temperature parameter should be introduced in this case. Since the mean-field methods always select the optimal action, the selected action should be independent of the temperature parameter. So I think the temperature parameter in the following line of code is meaningless. https://github.com/BorealisAI/mtmfrl/blob/d2d44796ce76a0eb25303505e272d30c201c75bc/multibattle/mfrl/examples/battle_model/algo/base.py#L55 https://github.com/BorealisAI/mtmfrl/blob/d2d44796ce76a0eb25303505e272d30c201c75bc/multibattle/mfrl/examples/battle_model/algo/base.py#L145-L146

In addition, the author of MFRL mentioned in the original paper:

To balance the trade-off between exploration and exploitation under low temperature settings, we use a policy with Boltzmann exploration and a decayed exploring temperature. The temperature for Boltzmann exploration of MF-Q is multiplied by a decay factor exponentially through out the training process.

From an implementation perspective, however, the temperature parameter in mean-field methods is redundant.

On the other hand, the "greedy" Boltzmann policy in mean-field methods leads to great differences in the experimental results under different seeds. Under some bad seeds, the initial policy makes it difficult for the agent to obtain rewards, which slows down the learning of the algorithm. Each experiment should be repeated using a number of independent training runs, which is very common in the reinforcement learning field. For mean-field methods, I wonder if each experiment runs some times independently with different random seeds to avoid the influence of outliers.

Could you answer my above questions?

Sriram94 commented 2 years ago

Ok, I think there are multiple questions in your comment:

1) Why use temperature parameter at all?

The temperature is important from the theoretical perspective. If you look at the Appendix D of the MFRL paper: https://arxiv.org/pdf/1802.05438.pdf, the proof of convergence requires a condition on the temperature (we need to guarantee a low temperature).

From an empirical perspective, we could easily use a Boltzmann exploration (by choosing actions with a weighted probability) through the Boltzmann policy if an exploration is required. When the number of agents in the neighbourhood is small, a Boltzmann exploration is generally considered useful. For example, in the Ising model experiment of MFRL, the authors did use the Boltzmann exploration since the mean field only considers 4 other agents (since the number of agents is small, induced exploration is helpful), see lines 55 -- 67 in https://github.com/mlii/mfrl/blob/d9a2dbca6f50687a2d2c2f0d613dc57b4cc4f9a0/main_MFQ_Ising.py#L55. However, in the Battle game experiments, since the number of agents considered for the mean field is large, an induced exploration is not considered by MFRL (there is sufficient stochasticity in the environment already).

2) Why use temperature parameter in the code base?

Since, all environments in our experiments use large numbers of agents (above 200), the temperature variable is not required. However, notice that in our codebase, we have the number of agents as a command line parameter. So if you would like to try our settings with small numbers of agents (about 15 -- 30), then a full Boltzmann exploration is recommended (that is why we have retained the temperature variable in the codebase), and an action should be chosen using weighted probability. The explanation is related to my answer to 1) above.

3) MFRL mentioned Boltzmann exploration in their paper?

Yes, but this is in the context of the Ising model. Refer to 1) above. So, your statement that the temperature parameter is redundant in mean field methods is not completely correct (it is useful when you have small numbers of agents).

4) For mean-field methods, I wonder if each experiment runs some times independently with different random seeds to avoid the influence of outliers?

Sorry, I did not understand this question. Can you repeat/restate it? Yes, it is necessary to repeat the experiments many times with different random seeds as is done in RL. Here the variances are expected to be large across independent runs due to the nature of the experiments (many agents executing actions atomically).

I hope that answered your questions.

guestreturn commented 2 years ago

Thank you for your prompt reply! Some problems have been solved. I still have some questions.

1) It seemed that the temperature parameter is only useful in Ising model and it is not involved in the update process in the Battle game experiments. In the Battle game experiments, no matter what the temperature parameter is, it will not affect the learning of the whole algorithm. Is that right?

2) The author of MFRL said in Appendix C.3 of the original paper:

IL and MF-Q have almost the same hyper-parameters settings. The learning rate is $\alpha = 10^{-4}$, and with a dynamic exploration rate linearly decays from $\gamma = 1.0$ to $\gamma = 0.05$ during the 2000 rounds training.

It seemed that MF-Q used $\varepsilon$-greedy exploration approach in the Battle game experiments but I did not find the implementation of $\varepsilon$-greedy exploration in the code.

3) I found that there is only one learning curve for each algorithm in the papers of mean-field methods. So I am worried about whether the learning curve about only one experiment can really show the performance of the algorithm.

Sriram94 commented 2 years ago
  1. That is correct "in principle". For the experiments designed in MFRL and in our paper, this will not matter. However, you are welcome to try our settings with fewer agents where it is recommended to use a weighted probability through the Boltzmann exploration (as mentioned in my previous comment). In this case, the temperature parameter will affect the learning of the algorithms.

  2. Sorry I am unable to answer this. Can you open an issue in their repository?

  3. So, all the results in our paper are averages of multiple seeds of training. As mentioned in the first paragraph of Section 5 in our paper, we repeat all experiments 50 times and use the averages for our inferences. MFRL also uses the same approach, but I do not think they mentioned how many times they repeated their experiments in their paper (you could open an issue in their repository to ask them).

guestreturn commented 2 years ago

With your help I am clear about the details of the mean-field method. Thank you very much!

The authors of MFRL have not responded to the issues in their repository in recent years so I am not sure I would get a response from them in time. It is strange that the temperature parameter is involved in the proof but has no effect in the experiments. The original intention of the mean-field method is to solve the problem of large-scale agents. So verifying the necessity of the temperature parameter under the setting of a few agents is "putting the cart before the horse". MFRL gives me a feeling that the implementation in the original paper do not match that in the code.

Anyway, thanks for your answer! POMFQ and MTMFQ are excellent work with solid theory and I benefited a lot from them.

Sriram94 commented 2 years ago

Thank you for your comments. Glad my answers were helpful.

In general your observations in the previous comment seem to be valid. I have provided some of my perspectives in the earlier answers. However, I think it is best for the authors of MFRL to respond to your questions/comments on their paper. It is unfortunate that they have not responded to issues in their repository in recent years (maybe you can try emailing them?).

Sriram94 commented 2 years ago

I am going to close this issue. Please feel free to reach out if you need anything else.