GT-STAR-Lab / MARBLER

Multi-Robot RL Benchmark and Learning Environment for the Robotarium | IEEE MRS 2023
https://shubhlohiya.github.io/MARBLER/
MIT License
9 stars 3 forks source link

Reproducibility: VDN Results #16

Open gsavarela opened 4 months ago

gsavarela commented 4 months ago

Hi,

Thanks for your contribution to the communities of reinforcement learning and robotics.

Unfortunately, I am having problems reproducing the results for VDN for tasks Arctic Transport, Material Transport, Predator Capture Prey, in table II of the article. Oddly enough Warehouse seems okay. Perhaps you could confirm that your method of aggregating runs follows Papoudakis, et al. 2021?

Maximum returns: For each algorithm, we identify the evaluation timestep during training in which the algorithm achieves the highest average evaluation returns across five random seeds. We report the average returns and the 95% confidence interval across five seeds from this evaluation timestep

Moreover the configuration files for each experiment is consistent with #13. The undiscounted returns and their respective 95% (normal) confidence intervals for each tasks are as follows:

Arctic Transport: -28.315 +/- 0.89 Material Transport: 21.895 +/- 0.74 Predator Capture Prey: 125.094 +/- 2.45 Warehouse: 28.572 +/- 0.44

While the ones in the paper are:

Arctic Transport: -6.98 +/- 1.75 Material Transport: 5.15 +/- 1.3 Predator Capture Prey: 33.25 +/- 0.46 Warehouse: 28.7+/- 1.49

Additionally, I send the plots obtained for each task, and the pattern for the algorithm is consistent with the published versions (Figure 3):

ArcticTransport MaterialTransport PredatorCapturePrey Warehouse

What am I missing? Should I normalize for the number of agents? Epymarl is for cooperative MARL, perhaps the reward signals are being aggregated into the joint rewards? Could you please clarify?

Regards, Guilherme Varela

RezaTorbati commented 6 days ago

Hello,

I apologize for responding to you so late! I somehow missed this issue until just now.

It seems that you may be using different scenario hyper parameters than we did. For example, in predator capture prey, we used 6 prey with a sensing reward of 1 and a capture reward of 5. If all 5 prey are sensed and then captured in a single time step, the reward for an episode would be 36 (56 + 16). This is impossible but this gives a high upper bound on the maximum episode reward for PCP so a reward of over 120 implies that you're either using different rewards, more prey, or there is a bug somewhere in how you're saving the results.

Section III of the supplementary material contains all of the configuration parameters we used for each scenario. To see how your scenario is configured, go to robotarium_gym/scenarios/<scenario name>/config.yaml.

If checking this does not solve your problem, the average returns reported for each scenario was gathered by literally running the "best" agent for each scenario from each algorithm 100 times and then averaging the returns. The agents we did this on are saved in robotarium_gym/scenarios/<scenario name>/models. While I don't remember saving the seed for the specific runs we had, evaluating these models should give you a very, very close score to what we got if you're using the same scenario configuration.

Again, I'm sorry for responding so late! Please let me know if this does not solve your problem.

RezaTorbati commented 6 days ago

Now that I'm re-reading your comment, it does seem that you should normalize across the number of agents in all scenarios except warehouse. The reward that gets passed after each step for a scenario is either a list with a different value for each agent or a float that is the reward for all agents. In all scenarios except for warehouse, the reward for each agent is the float while warehouse returns a list with different values for each agent.

When generating the rewards for each episode, we summed the rewards across all timesteps. To get the reward for a single timestep, we either directly took the returned float or summed the list into a single value depending on what was returned.

Sorry for the confusion, I hope this answers your question!