BeTomorrow / ReImproveJS

A framework using TensorFlow.js for Deep Reinforcement Learning
MIT License
122 stars 16 forks source link

How to execute actions (output) #11

Open samid737 opened 5 years ago

samid737 commented 5 years ago

Hi,

I am still new to Reinforcement learning, I remember Reinforce JS did have this act() function, which takes as input the state variables and outputs an action. How can we act upon the output of the RL agent in ReImproveJS?

thanks in advance!

Pravez commented 5 years ago

In Reinforcement learning the "only" way you have to act on the agent is through the rewards system. You have to balance well between positive or negative rewards in order to make your agent learn. For instance if you want your agent to go to the right, you will likely put a positive reward on its increasing x position, and a negative reward on its decreasing x position.

ReImproveJS does "everything" for you, meaning you just have to call in the step() function either academy.addRewardToAgent(agent, -1.0) or academy.addRewardToAgent(agent, 1.0) according to what your agent was supposed to do.

Let me know if it was you understood well :)

samid737 commented 5 years ago

Hello Pravez, thanks for replying! I am still a bit confused to be honest. I will demonstrate to clarify my goal, here is the test case:

http://jsfiddle.net/ydaqhpwL/3/

There is a controllable player and a robot (red, the RL agent). The player can be moved using updownleftright.

The learning objective is to follow the player. The reward scheme is calculated according to the distance (in calculateReward function).

I was thinking that the current setup would be something that could work, but after 10-15s the game freezes for a bit and afterwards the robot drifts away. I used the actionsBuffer (line 137) previously to make the robot (red) move, but I'm not sure if that makes sense anymore.

I did have some sensible result in ReinforceJS in a similar test case, where the objective, simulation world and reward scheme is similar:

https://codepen.io/Samid737/pen/opmvaR

Here, in line 85, one of the four possible actions is chosen by the agent.

P.S: I would gladly help out with creating examples (using Phaser JS or other frameworks) if that is on the roadmap. Thanks again in advance.

Pravez commented 5 years ago

Indeed the example with ReinforceJS works really nicely !

Thank you for your example, I think it will perfectly suit as an example for ReimproveJS... !

First of all you have some "errors" in your code that might change some of the data the agent is learning on.

line 127

var s = [player.y,player.y,robot.x, robot.y];

I think you wanted

var s = [player.x,player.y,robot.x, robot.y];

line 185

 var dist = Phaser.Math.Distance.Between(player.x,robot.x,player.y,robot.y);

According to phaser.js it should probably more be like

 var dist = Phaser.Math.Distance.Between(player.x,player.y,robot.x,robot.y);

Also, the step() function from the academy already returns the action the agent took. So you just have to call step().get(agent) to get the action. No need to go deep directly into the agent's data.

ReImproveJS, because built on top of TensorflowJS has a different way of learning than ReinforceJS. Indeed, it continuously do WebGL calls (and so, your GPU), and needs some more "time" to do backpropagation (that's why the step function is async). I designed so the learning phase in order to reduce a bit the impact of this time.

Your agent has X learning sessions of Y steps. At the end of these Y steps, it will do its learning on the entire session it recorded. At the end of the X sessions, it will only do inference. According to your use case, you should do many very little sessions. Also, your agent really cares (I think) about its future rewards, so your gamma should also be really high. Another thing, your agent is training on parts of his memory, randomly selected. If its memory is too big, and its learning is too much made of bad decisions, it will statistically select more bad decisions to learn on and will indefinitely stay on it. A way to improve it would be to drastically reduce its size, so that with time the bad decisions are replaced with good ones, etc ...

And, little detail (I think it's my documentation that is wrong, so mb) but it's lessonLength and not lessonsLength.

Here is the configuration I would have used :+1:

const teacherConfig = {
    lessonsQuantity: 10000,                  
    lessonLength: 20,                    
    lessonsWithRandom: 0,                  // We do not care about full random sessions
    epsilon: 0.5,                            // Maybe a higher random rate at the beginning ?
    epsilonDecay: 0.995,                   
    epsilonMin: 0.05,
    gamma: 0.9                            
};

const agentConfig = {
    model: model,                          // Our model corresponding to the agent
    agentConfig: {
        memorySize: 1000,                      // The size of the agent's memory (Q-Learning)
        batchSize: 128,                        // How many tensors will be given to the network when fit
        temporalWindow: temporalWindow         // The temporal window giving previous inputs & actions
    }
};

Here is the link to the modified jsfiddle : http://jsfiddle.net/ydaqhpwL/7/

This said, it does not work either :disappointed: . I will try to investigate further quickly and keep you informed.

Thank you really much anyway for your interest and work :+1:

Pravez commented 5 years ago

Indeed the example with ReinforceJS works really nicely !

Thank you for your example, I think it will perfectly suit as an example for ReimproveJS... !

First of all you have some "errors" in your code that might change some of the data the agent is learning on.

line 127

var s = [player.y,player.y,robot.x, robot.y];

I think you wanted

var s = [player.x,player.y,robot.x, robot.y];

line 185

 var dist = Phaser.Math.Distance.Between(player.x,robot.x,player.y,robot.y);

According to phaser.js it should probably more be like

 var dist = Phaser.Math.Distance.Between(player.x,player.y,robot.x,robot.y);

Also, the step() function from the academy already returns the action the agent took. So you just have to call step().get(agent) to get the action. No need to go deep directly into the agent's data.

ReImproveJS, because built on top of TensorflowJS has a different way of learning than ReinforceJS. Indeed, it continuously do WebGL calls (and so, your GPU), and needs some more "time" to do backpropagation (that's why the step function is async). I designed so the learning phase in order to reduce a bit the impact of this time.

Your agent has X learning sessions of Y steps. At the end of these Y steps, it will do its learning on the entire session it recorded. At the end of the X sessions, it will only do inference. According to your use case, you should do many very little sessions. Also, your agent really cares (I think) about its future rewards, so your gamma should also be really high. Another thing, your agent is training on parts of his memory, randomly selected. If its memory is too big, and its learning is too much made of bad decisions, it will statistically select more bad decisions to learn on and will indefinitely stay on it. A way to improve it would be to drastically reduce its size, so that with time the bad decisions are replaced with good ones, etc ...

And, little detail (I think it's my documentation that is wrong, so mb) but it's lessonLength and not lessonsLength.

Here is the configuration I would have used :+1:

const teacherConfig = {
    lessonsQuantity: 10000,                  
    lessonLength: 20,                    
    lessonsWithRandom: 0,                  // We do not care about full random sessions
    epsilon: 0.5,                            // Maybe a higher random rate at the beginning ?
    epsilonDecay: 0.995,                   
    epsilonMin: 0.05,
    gamma: 0.9                            
};

const agentConfig = {
    model: model,                          // Our model corresponding to the agent
    agentConfig: {
        memorySize: 1000,                      // The size of the agent's memory (Q-Learning)
        batchSize: 128,                        // How many tensors will be given to the network when fit
        temporalWindow: temporalWindow         // The temporal window giving previous inputs & actions
    }
};

Here is the link to the modified jsfiddle : http://jsfiddle.net/ydaqhpwL/7/

This said, it does not work either :disappointed: . I will try to investigate further quickly and keep you informed.

Thank you really much anyway for your interest and work :+1:

samid737 commented 5 years ago

Thanks Pravez, I did overlook the x's and y's indeed. Yes, RL is extremely interesting, especially when seeing actual results/use cases, Phaser really helps in this imo. I do need to dive into the theory... It seems like the async operations don't do very well in update(), but I can't really tell whats causing the hickups. Maybe @photonstorm has a clue ;p?

RGBKnights commented 5 years ago

I have been interested in this project for some time now as I plan on using it within my Screeps environment. I finally had time to sit down with it and worked out an example. For those interested, I created a gist with a more or less complete example: https://gist.github.com/RGBKnights/756b5f51465cc22d0ca39205979ad2a1

Pravez commented 5 years ago

Thank you for your time ! I'll add this exemple to the README. I will certainly create a new one based on yours when i'll have enough time.