PWhiddy / PokemonRedExperiments

Playing Pokemon Red with Reinforcement Learning
MIT License
6.91k stars 632 forks source link

changing novelty reward to encourage return to pokemon center to heal #53

Open owqejfb opened 1 year ago

owqejfb commented 1 year ago

Watched the video last night and wow, loved the challenge and the topic.

One thing that stood out to me is how the agent just goes forward till it gets wiped out and repeats. I assume that the current strategy of the agent is to go forward till it wipes out, and repeat until the team is strong enough to go through the next area.

Was wondering if you had thought of any changes to the novelty reward to make it play more like a human does in the sense that we seek novelty when the team is high hp, but seek familiarity (a pokemon center) when the team is low hp.

Had an idea of making a formula that weighs the novelty reward with the current % of HP (maybe something else to prevent a bias against getting a team full of snorlax/chansey). So that when you are healthy, you seek novelty, but as your hp begins to get lower, the novelty rewards flips and you reject new frames and seek old ones in an attempt to heal your pokemon back up to prevent being wiped out and losing $.

Looking forward to working on this project when i get the chance, but again curious if you have thought of a solution to get the agent to go back to the pokemon center before being wiped out

RWayne93 commented 1 year ago

@pernutbrian I could be wrong but based on reading this and using the healing function i figured maybe changing the reward for exploration based on the hp of the party so for example:

def exploration_weight(self):
    hp_fraction = self.read_hp_fraction()
    return 0.5 + 0.5 * hp_fraction

then we use this with the exploration

def get_knn_reward(self):
    exploration_factor = self.exploration_weight()

    pre_rew = 0.004 * exploration_factor
    post_rew = 0.01 * exploration_factor

    cur_size = self.knn_index.get_current_count()
    base = (self.base_explore if self.levels_satisfied else cur_size) * pre_rew
    post = (cur_size if self.levels_satisfied else 0) * post_rew

    return base + post

or maybe do something like this for when the the party is critically low or we are down to 1 pokemon etc.

def exploration_weight(self):
    hp_fraction = self.read_hp_fraction()
    if hp_fraction < 0.1:  # If HP is less than 10%
        return 0.1
    return 0.5 + 0.5 * hp_fraction

not sure how this will affect the overall performance of the agent but just something i could whip up real quick.

editing because i guess i didn't address the novelty that you mentioned. if i remember correctly the video mentions the agents memory is only for 3 frames due to memory constraints ( i could be wrong on this ) i think he said in the video this was due to memory constraints while training. so with this in mind maybe something like memory hashing using a set and calling the create_recent_memory method in the red_gym_env.py file and then producing a novelty reward off that and adding it to the game_state_rewards.

if this is incorrect i apologize machine learning isn't my strong suit lately i have been working on optimizations for projects using compiled languages (mainly rust) and creating python bindings for them.

owqejfb commented 1 year ago

this is generally where i think the reward should go. cause currently the agents brute force strategy would make it so that as long as they battle enough and get overlevelled they will succeed (assuming the issue with getting past caves is solved).

Ideally i think there should be a negative reward for exploration when you are close to being wiped out, as its something human players tend to avoid.

hopefully one of these days i have time to fork and try the reward change and comeback with a solution instead of issues!

dukemagus commented 1 year ago

Remember that part of the idea is to see how the RNN behaves based on the meta objective and (in theory) with the minimum hand holding possible, not make a full chart of all objectives and meta objectives of the game (by that point it'll be more of a TAS than an observation of emerging behavior).

From what i understand, the idea for this experiment has the meta objective of "beating the game" or at least "advancing as much as possible", with the least ammount of rules to see how the AI tries, fails and iterates with each new attempt.

If you give it points for going to the pokémon center, it'll leave and enter the same building forever to rack up points. if you add complex rules, you're preventing the AI from adopt the reinforced behavior of healing because that makes it go further and earn more points without getting stuck. Also remember the algorithm itself is greedy and has no pre programmed concept of long term investment. and zero spatial awareness. There's no "what i'm doing" or "which coordinates i'm in right now" methods hard coded into it. it'll always try to earn the most points at each button press and avoid losing points as much as possible.

Another problem is that buildings and specially pokémon centers are mostly identical, and it's harder for a program designed to discover new screens to go on different pokémon centers. they are too similar to earn points

IF, and only IF @PWhiddy is comfortable with people expanding the general idea, people could pool ideas of what they want to study and how to examine that behavior with the minimum ammount of instructions, and whomever is more comfortable or has more resources to run the simulation could manage forks with different intents.

Lawbayly commented 1 year ago

The exploration reward really has it's issues and I think there must be a better way to do it. It's issues really come into their own in the cave where the AI needs to backtrack or can't figure out that there is a wall. A part of me wants to punish for collisions into walls. (Maybe based on the bump sound) or have a score decay when it doesn't make progress after a period of time, but I suspect I might just lower the score reward for exploration.

owqejfb commented 1 year ago

@dukemagus going to reply to your points in a list for ease since I’m on mobile

  1. My idea wasn’t to make a chart of objectives, it’s to scale the novelty reward to HP to address an issue I see. He did that throughout the video

  2. My idea for the change to novelty (which is very loose), won’t make the agent spam the Pokémon center because once it has full hp, the reward for familiarity > novelty is gone and is now back to novelty > familiarity.

  3. You bringing up its greedy is a good point, can possible see the agent not being able to get to a new area again if everything around the Pokémon center has been visited. Could also put itself in a loop where it purposely gets low hp to then boost at the center.

  4. Why would someone make a public repo and not want people to expand on it? I mentioned this idea of making the novelty reward a dynamic variable based on HP cause I was curious if he had tried something like it so I would know for when I have time to work on a fork. Wasn’t asking for permission to change or for him to change his current implementation

owqejfb commented 1 year ago

@Lawbayly one thing I’ve thought of to get out of the cave is trying to run the project on fire red instead of red. Peter said that an issue was the cave sections being so similar that it can’t tell it’s in a new area, I imagine there must be a bit more graphical difference since fire red is 16bit vs 8bit.

this is a pretty big change to the project but I wonder if the issue at hand is “the algo doesn’t work yet” or “the graphics suck too much”. For all we know, the current implementation can beat the E4 but is limited by the graphics

Lawbayly commented 1 year ago

@Lawbayly one thing I’ve thought of to get out of the cave is trying to run the project on fire red instead of red. Peter said that an issue was the cave sections being so similar that it can’t tell it’s in a new area, I imagine there must be a bit more graphical difference since fire red is 16bit vs 8bit.

this is a pretty big change to the project but I wonder if the issue at hand is “the algo doesn’t work yet” or “the graphics suck too much”. For all we know, the current implementation can beat the E4 but is limited by the graphics

I think the problem with that is the processing involved injesting the frames, I don't think it's an accident he picked the original game and left it in black and white.

He is literally pulling the frames in and checking whether they are radically different from the previous frames (haven't found the exact code which controls how radical yet).

The issue is the memory isn't perfect either way, I've modified my version to give it 63 pokeballs and have added all the levels of the pokemon in the boxes to try and encourage it to catch more pokemon, I'm also thinking of giving it escape ropes to see if it can use those to get around if it gets stuck instead of just keeping going until it faints.

For the Fire Red Idea... pyboy doesn't have GBA support as far as I'm aware so you would need to use something else or build your own python GBA emulator.

As for beating the Elite Four... I'm doubtful, for one it will get stuck as it won't realise it needs to teach cut to a pokemon at a particular point and also it's party will likely consist of whatever it catches, so a blastoise, a pidgeot, a magikarp, a graveler, a golbat, and a Raticate.

RWayne93 commented 1 year ago

@Lawbayly I was also thinking about the "cut" issue recently as well for trying to get the 3rd gym. with it all being under one RL network idk how it will learn that it needs to use cut to progress further.

owqejfb commented 1 year ago

@RWayne93 I think currently it cant. I saw in another issue that @PWhiddy realized the start button has been disabled, so it wont be able to go in and teach cut to a pokemon. Biggest issue ive been thinking of is how to get back to viridian city for 8th gym

RWayne93 commented 1 year ago

The issue there would be to somehow get it to learn fly for easy back tracking.

dukemagus commented 1 year ago

flying and backtracking efficiently is a future concern IMO. Given the RL agent has no memory and can't read, infer or store information from chat bubbles. That includes returning from a previous area. right now it doesn't understand "where" it is. there's no hardcoded or emerging info saying "this place is pallet town", for example. It won't even know when it has to come back. There's even the risk it would randomly fly to any city it can at any moment.

From as far as the model went during the video, we still don't have a complete grasp of its "break points". And if you want an approach that gets as much autonomy from the model as possible, you need to see these break points happening and "patch" new parameters that makes it progress beyond it with the least ammount of instructions.

Else, you could just make it a memory reader and ignore visual data (there are other projects like that) or copy all data tables from bulbapedia and make it know every rule of the game before even starting. But at this point it's more of a TAS than a emerging behavior experiment

erick6697 commented 1 year ago

First of all sorry for bad english, not my main language. I've been thinking about this for a couple hours and I´ve came up with some possible solutions and ideas.

The fisrt thing I want to mention is about the penalty from losing battles. Theres a penalty from losing battles but I think there should also be a reward from winning battles. Righ now the agents "logic" is:

Every time it gets into a battle the only thing it cares about is NOT losing, it really doesn't care about winning.

The action of losing/winning a battle isn't really a bad/good thing, the real penalty/reward comes from the consequence of it. Which is:

I apologize in advance for bad terminology , I don´t really understand the technical stuff of the proyect (I'm taking some programming and AI courses to fully understang this) but I do get the logical part and I wanted to share my ideas as I think it could contribute.