Farama-Foundation / Arcade-Learning-Environment

The Arcade Learning Environment (ALE) -- a platform for AI research.
https://ale.farama.org/
GNU General Public License v2.0
2.14k stars 420 forks source link

non-scalar rewards (e. g. Space Shuttle) #191

Closed nczempin closed 7 years ago

nczempin commented 7 years ago

I have the crazy idea of adding perhaps the most complex Atari 2600 game ever, Space Shuttle, to the ALE:

It requires a considerable amount of patience and, perhaps not too surprisingly, a considerable amount of brainpower.

Aside from the creative use of console buttons (as opposed to, or rather, in addition to, joystick input) in that program, I anticipate at least the following issue:

According to the Space Shuttle manual, it is desirable both to achieve multiple dockings and to have as much fuel left as possible. So there are multiple values that could be interpreted as the reward, and the implied reward is, in a first approximation, some kind of linear combination of them.

Now I could go the easy route and simply choose one of these linear combinations and expose the result (e. g. dockings * 10000 + fuel_left), it would essentially be arbitrary. It could also be nonlinear.

So I would prefer to be able to expose the whole vector of rewards, most likely in addition to the existing scalar one, so as not to break any existing code.

Once that exists, there would be the option of exposing other variables, such as number of lives left, or perhaps ammunition for some games, that would let agents use their own interpretations of reward.

There still should be the existing interface (as mentioned), and perhaps it would be useful to provide one "default" mapping from the reward vectors to a scalar. However, that would (should) probably involve some serious discussion before anything "official" is set in stone.

mgbellemare commented 7 years ago

From what you describe, Space Shuttle sounds like a serious endeavour, maybe better suited to a game-specific fork (with the right actions baked in)? It would certainly require a revamp of the actions interface.

Reading the manual, my immediate thought is that in the absence of score I would make the reward function based on the rank attributed to you: +1 for "Payload Specialist", up to "+4" for "Commander". Yes, that's harsh, but that's how the other reward functions (where appropriate) were crafted! You could imagine using something like intrinsic motivation to get the agent to play the game without reward.

nczempin commented 7 years ago

Yes, I'd thought about just using the "ranks". It's probably less arbitrary than any other scheme.

Or perhaps with one additional signal: Achieving one docking but with less fuel than required for that first rank.

After all, the vast number of humans playing this game will not reach rank 1; the "write in to get patches" scheme by Activision was meant to encourage the try-hards...

I agree that to support this, just supporting the actions would take considerable effort without a huge benefit overall, just for one particular game. That's one of the reasons I called it "crazy" :-)

I'm not a big fan of forks, though; eventually it should be possible to support all the Atari 2600 games. A fork would make sense if this were some pet project of someone who wanted to get something out of the door quickly.

It's not that high on my todo list right now.

nczempin commented 7 years ago

You could imagine using something like intrinsic motivation to get the agent to play the game without reward.

That looks interesting. @mgbellemare; I'm immediately thinking about running this on Adventure. Is the code available somewhere?

nczempin commented 7 years ago

I think that providing a vector of rewards or of generic information about a game that can be used by clients to construct their own reward functions would be useful, independent from Space Shuttle.

In providing Adventure I made an arbitrary choice to use only the results "eaten by dragon" and "brought the chalice to the golden castle". While that is an acceptable (IMHO) scheme for end-to-end learning, some people may want to use other bits of information; sorta pre-provided features.

This can be done without affecting those who really just want to learn generically from the pixels, by offering optional interfaces to states that are game-specific.

So while for many the end goal is not to use any domain knowledge, perhaps some want to work on particular games. It is not fair to expect the users to extract this from the RAM state; the people who know how to find the features are the same ones who know how to find the score and the terminal condition. So it should be an optional service of the ALE.

nczempin commented 7 years ago

Oh, and we're already doing this, by providing minimal actions and lives in some cases. I'd just like to be this to be more uniform, dynamic, and extensible.

mgbellemare commented 7 years ago

I like the idea, but I'll have to pass on this one.

I would argue both the minimal action set and the lives information, which were added afterwards, have hurt more than helped the ALE project. With this extra information, it becomes just a little harder to distinguish truly generally competent agents from those that rely on that specific information. For example, there are agents that perform really poorly on Breakout if they aren't provided with the lives information, but very well otherwise.

While I agree it's tempting to add it "just for when it's needed", history shows any extra information ends up getting folded in as part of the default eventually, making state-of-the-art agent design more complicated.

nczempin commented 7 years ago

Good points. I guess the underlying concern might be that it is not always clear what the rewards should be (see adventure), and while you clearly put a lot of thought into them, others may want to use other ones (I guess I may just be repeating myself here); your approach seems to assume there should be only one way to use the ALE.

I see where you're coming from, though. My idea for resolving a potential "how can we extend ALE without making it virtually impossible to compare research results?" would be to provide certain default settings, and researchers deviating from them would need to communicate their changes.

However, I'm pretty sure that you'll have this problem already anyway, for example when supporting additional setting. Up to now (/soon) it's very possible that results are less comparable because people may have used different ROMs from each other (don't think they differ wildly, but we can't be sure).

Frame skipping and how deterministic the settings are also seem to have similar issues. And I think the way to deal with them is to find a unified way to deal with these variations.

Another aspect may be this: Is the only goal to provide an environment for agents to get from pixels to actions to rewards? IMHO one goal could be to give the agents the same conditions as humans. And while some games are simple enough for humans that they can play them without any help, some actually require the humans to e.g. the manuals, and while I would it fascinating​ to see if the approach from that FreeCiv paper could be useful in ALE, in the meantime a useful proxy of what could be extracted from manuals would be these intermediate rewards (could be used in the context of "curriculum learning" as well).

I hope that by writing this wall of text I'm not projecting the impression that this feature is enormously important to me. I'm perfectly fine with your decision; if it were all that important to me I could always just implement it and then we could have this discussion again ;-)

On 9 Apr 2017 10:19 pm, "Marc G. Bellemare" notifications@github.com wrote:

I like the idea, but I'll have to pass on this one.

I would argue both the minimal action set and the lives information, which were added afterwards, have hurt more than helped the ALE project. With this extra information, it becomes just a little harder to distinguish truly generally competent agents from those that rely on that specific information. For example, there are agents that perform really poorly on Breakout if they aren't provided with the lives information, but very well otherwise.

While I agree it's tempting to add it "just for when it's needed", history shows any extra information ends up getting folded in as part of the default eventually, making state-of-the-art agent design more complicated.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mgbellemare/Arcade-Learning-Environment/issues/191#issuecomment-292810107, or mute the thread https://github.com/notifications/unsubscribe-auth/ABsunWoiodWWTXLDrRkSpY7mC3dsYOYHks5ruT1TgaJpZM4MiXd4 .