Grzego / async-rl

Variation of "Asynchronous Methods for Deep Reinforcement Learning" with multiple processes generating experience for agent (Keras + Theano + OpenAI Gym)[1-step Q-learning, n-step Q-learning, A3C]
MIT License
44 stars 12 forks source link

Model performance issue #1

Closed pavitrakumar78 closed 7 years ago

pavitrakumar78 commented 7 years ago

Hi,

I am trying to train a breakout game model using your batch-a3c code. I am using a p2.xlarge instance (NVIDIA K80) and it has training been about 20-22 hours and it has completed about ~33m frames. From your graph, assuming each unit in x-axis is 10m training steps, at about 30-25m steps, it has achieved an average score of atleast 30-50. But my model unfortunately barely gets a 10 - its mostly still zeros. I am not sure why this happens.

I am using the same 80m steps and all the default params are the same.

These are the statistics at about ~34m frames. Frames: 34300000; Policy_loss: -0.701535; Value_loss: 0.002712; Entrophy: 0.345996; V-value_avg: 0.439

Also, I should mention that I am using 16 acting agents so I believe the performance should be much better, correct? Other doubts that I have are: -Would it help to increase the queue length as the number of agents increase wouldn't this diversify the different experiences the agent can learn?

-During my training, I paused and resumed training once (at 6m steps) - since most of the code is dependent of the network weights and the number of train steps (obtained from the input args) - this shouldn't have had any effect on the training right?

-The total training steps is the one counted in the learn_proc() method right? each actor process's train steps progress much slower than the learn_proc() thread - is this normal?

-There is no stopping condition, so I am assuming the training ran for 80m steps as indicated by the message printed out by LearningAgent's learn() method.

I will let the training continue for another 10m steps and see if there are any performance changes.

Thank you!

Grzego commented 7 years ago

Hi,

I found a small bug in code that could cause some agents to get stuck (choosing always the same action) and therefore stop them from generating useful experience. So my guess is that in your case after a while some, if not all, agents just got stuck. I pushed a fix already. Tested it for 6m iterations with 16 generating agent and was able to achieve average score of 10 points and 30 points after 10m iterations.

I don't know the spec of p2.xlarge instance but ~33m frames in 20-22h is quite slow. On my machine (i5-6600K, GTX1080) training for 80m frames took about 23h total, so thats strange.

Answering your questions.

I hope this helps. If you have more questions feel free to ask. :)

pavitrakumar78 commented 7 years ago

Hi,

Thank you for your reply!

I guess the version of K80 that Aamazon uses has significantly lower clock speeds than a GTX1080 - that could probably explain why it's slow. But 80m frames in less than a day is blazing fast! Unfortunately I only have a GTX 660 on my desktop so, K80 is the only way to go for me at the moment.

Thanks for the fix! I will try it again with 16 agents and will let you know how it performs.

Both the machines I work with (PC and EC2 instance) have 4 core config only. I will also try running with 4 agents and report score later on.

Your answers really helped me! :)

Your model (the 91m frames) for breakout and all other tests were conducted using the default params in your code right? (i.e 4 process, lr, batch size etc..) - at least for a3c.

Grzego commented 7 years ago

In my case GPU usage was about ~30-35% and CPU always 100%, so I think it should work on your desktop quite well.

Yes, I used default parameters for Breakout model and tests. :)

pavitrakumar78 commented 7 years ago

Hi,

Sorry for the late reply!

I have tried running the updated train.py for a3c and I am still getting the same results! :(

Just to confirm, my library versions are: Keras: 1.1.2 Theano: 0.8.2 numpy: 1.11.2 scipy: 0.18.1

I ran the exact same file (train.py) with the default parameters (4 agents). I added some logging function to it.

Here is the log info from the 0th actor thread (inside generate_experience_proc method): training_log_actor-0 (copy).txt

Here is the full log info from the learner thread: full_log.txt

I ran this on my PC (NVIDIA GTX 660), and it completed about 19m frames in 13-14 hours. Even after 19m frames the average is 1.XX! :( shouldn't it be at least 10?

Why is it that I am getting vastly lower performance with the same code in my computer?

Grzego commented 7 years ago

Looks like something is wrong elsewhere. I suspect this (Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.) Theano warning might have something to do with it.

Please do this four things:

After upgrading Theano to bleeding edge, run some simple convolutional network on mnist to confirm its learning.

My library versions: Keras: 1.1.1 Theano: 0.9.0.dev3 numpy: 1.11.2 scipy: 0.18.1

I hope we will figure it out soon. :)

pavitrakumar78 commented 7 years ago

Hi,

Thank you for the suggestions!

-My action set is set as default i.e env.action_space so I will set it as: 
if args.game == 'Breakout-v0' :
    action_space = Discrete(4)

(env.action_space for breakout returns Discrete(6))
- Just upgraded it.

I will run the theano checks.
Edit: Just ran the theano checks.
Theano version: 0.9.0dev4
Using the cnn code from [here](http://deeplearning.net/tutorial/lenet.html) running on mnist:

http://pastebin.com/x7wrau79

I will train for 15-20m steps with the changes I've made and let you know the results!

About the theano warning, I have been testing another implementation of original DQN using lasagne (which uses only theano backend) - and I had no problems learning breakout and other games using it.
Grzego commented 7 years ago

Don't change action_space because it may break something in gym. You can try using env.get_action_meanings() to see what actions are available to agent (my case: ['NOOP', 'FIRE', 'RIGHT', 'LEFT']) and post it here.

Model architecture and file seems to be ok.

To make sure its not a problem with Keras try:

from keras.models import *
from keras.layers import *
from keras.datasets import mnist

x = Input(shape=(28, 28))
h = Reshape((1, 28, 28))(x)
h = Convolution2D(8, 3, 3, activation='relu')(h)
h = Convolution2D(8, 3, 3, activation='relu')(h)
h = Convolution2D(8, 3, 3, activation='relu')(h)
h = Flatten()(h)
y = Dense(10, activation='softmax')(h)

model = Model(x, y)
model.compile('rmsprop', 'sparse_categorical_crossentropy', metrics=['accuracy'])

(train_x, train_y), _ = mnist.load_data()
model.fit(train_x, train_y, nb_epoch=1)

Result:

Epoch 1/1
60000/60000 [==============================] - 4s - loss: 0.2332 - acc: 0.9416

I guess you should have about ~95% accuracy as well.

pavitrakumar78 commented 7 years ago

The action meanings in gym has always been like this for me: (atleast for breakout)

['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']

So, in most of my experiments, I had to correct it to 4 or 3. I installed gym using pip install 'gym[all]' and gym version is gym-0.7.1. How come your gym only returns 4 actions?

Also, there was a problem with the keras code.

Python 2.7.6 (default, Oct 26 2016, 20:30:19) 
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from keras.models import *
Using Theano backend.
Using gpu device 0: GeForce GTX 660 (CNMeM is disabled, cuDNN 5105)
>>> from keras.layers import *
>>> from keras.datasets import mnist
>>> x = Input(shape=(28, 28))
>>> h = Reshape((1, 28, 28))(x)
>>> h = Convolution2D(8, 3, 3, activation='relu')(h)
>>> h = Convolution2D(8, 3, 3, activation='relu')(h)
>>> h = Convolution2D(8, 3, 3, activation='relu')(h)
>>> h = Flatten()(h)
>>> y = Dense(10, activation='softmax')(h)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 491, in __call__
    self.build(input_shapes[0])
  File "/usr/local/lib/python2.7/dist-packages/keras/layers/core.py", line 727, in build
    name='{}_W'.format(self.name))
  File "/usr/local/lib/python2.7/dist-packages/keras/initializations.py", line 60, in glorot_uniform
    return uniform(shape, s, name=name)
  File "/usr/local/lib/python2.7/dist-packages/keras/initializations.py", line 33, in uniform
    return K.random_uniform_variable(shape, -scale, scale, name=name)
  File "/usr/local/lib/python2.7/dist-packages/keras/backend/theano_backend.py", line 141, in random_uniform_variable
    return variable(np.random.uniform(low=low, high=high, size=shape),
  File "mtrand.pyx", line 1565, in mtrand.RandomState.uniform (numpy/random/mtrand/mtrand.c:17319)
OverflowError: Range exceeds valid bounds

But after referencing this, I added:

from keras import backend as K
K.set_image_dim_ordering('th')

and I was able to execute the code.

Result:

60000/60000 [==============================] - 9s - loss: 0.3006 - acc: 0.9402
Grzego commented 7 years ago

Maybe its because of different roms and/or version of gym (I'm using 0.5.6). Or maybe because right now I'm on Windows and I had to compile ALE and use lots of workarounds to make atari_py work at all. :)

So I think we found main issue. You can either use:

from keras import backend as K
K.set_image_dim_ordering('th')

at the begining of generate_experience_proc and learning_proc or set it permanently in .keras/keras.json with "image_dim_ordering": "th" if I'm correct.

Give it a try and let me know if that solves the issue. :)

pavitrakumar78 commented 7 years ago

Yes. I will give it a try and let you know!

The 4 actions that you get seems a bit weird tho. If you are using atari-py, it should return 6 actions as per the code here, but if you have compiled using the original ALE here - it has 4 actions as you have got. Maybe I need to update my .so files :) Edit: It is already an issue. Thanks for helping to solve the problem! :)

Grzego commented 7 years ago

Thats interesting. I wasn't aware of the difference between atari-py and ALE. I sort of compiled ALE and copied it to atari-py to make it work on Windows. :)

pavitrakumar78 commented 7 years ago

Hi,

I just let it run for another 14m epochs, but again I have the same problem! :( The average is still 1.XX and the image ordering does not seem to be the problem here. Model after 14m frames: model-Breakout-v0-14000000.h5.zip Score avg of actor 0: training_log_actor-0.txt Code I used for training: train_updated_new -code.txt Unfortunately I could not get the terminal output about loss/entrophy/etc. I closed the terminal before I could copy it. So, I resumed from the 14m frames checkpoint and let it run for a few mins. Here is the output:

Frames: 14005000; Policy-Loss:  -0.427857; Avg:  -0.019142 --- Value-Loss:   0.000391; Avg:   0.199642 --- Entropy: 0.346546; Avg: 0.346539 --- V-value; Min:  0Frames: 14010000; Policy-Loss:  -0.450229; Avg:  -0.013545 --- Value-Loss:   0.000488; Avg:   0.174907 --- Entropy: 0.346533; Avg: 0.346537 --- V-value; Min:  0  
Frames: 14015000; Policy-Loss:  -0.478949; Avg:  -0.012407 --- Value-Loss:   0.000597; Avg:   0.176477 --- Entropy: 0.346534; Avg: 0.346531 --- V-value; Min:  0.210; Max:  0.218; Avg:  0.213  7310> Best:    3; Avg:   1.00; Max:    3
  7318> Best:    3; Avg:   1.05; Max:    3
Frames: 14020000; Policy-Loss:  -1.874325; Avg:   0.470858 --- Value-Loss:   0.105481; Avg:   0.346848 --- Entropy: 0.346542; Avg: 0.346538 --- V-value; Min:  0.208; Max:  0.215; Avg:  0.211  7315> Best:    4; Avg:   1.24; Max:    4
  7312> Best:    5; Avg:   1.35; Max:    5
Frames: 14025000; Policy-Loss:  -0.457242; Avg:   0.031370 --- Value-Loss:   0.000610; Avg:   0.175511 --- Entropy: 0.346525; Avg: 0.346534 --- V-value; Min:  0.211; Max:  0.216; Avg:  0.214

At time point, it would be really helpful if somone else could run it on their comp and check it!

Another question.. You have only tested your code on python 3.x? I am testing it only 2.7 so maybe I should include the from __future__ import division? Edit: I will run again with that change. Even though I don't think it makes any difference, no harm in testing it out.

Edit 2: Good news! I am seeing some learning after about 1.25m epochs. I see that average score is now 2.XX (which I never saw in any of my previous runs) so I guess it was a python version compatibility issue! i.e. without the division, the lr formula would have defaulted to 0.000001 since the other param in the max always evaluates to 0 without future's division import. (1/2 evaluates as 0 on python 2.7) Anyways, I will let it run for atleast 10m epochs and let you know about the results.

pavitrakumar78 commented 7 years ago

Hi,

Thought I would make a separate post for this.

I have trained for 10m frames and the average is now at 14-16! :)

Full log: train_updated_new_v2.py full_log.txt Actor-0 score averages: training_log_actor-0.txt 10m train steps model: model-Breakout-v0-10000000.h5.zip Running test of 10 games after 10m frames:

Game #       1; Reward   18; 
Game #       2; Reward   15; 
Game #       3; Reward   20; 
Game #       4; Reward   18; 
Game #       5; Reward   18; 
Game #       6; Reward   15; 
Game #       7; Reward   30; 
Game #       8; Reward   25; 
Game #       9; Reward   10; 
Game #      10; Reward   17; 

Note: The above results were obtained using the code with a modified move space of 4 (which are in the pull request) I have created a few pull requests to make the code compatible with both python 2.x and 3.x and a few other changes to account for breakout's move space. Review them if you can! :)

Thank you for helping to resolve this issue! :)

Grzego commented 7 years ago

I'm glad it's finally working for you! :)

wulabs commented 7 years ago

Thank you guys. The future division import helped me on Python 2.7 as well.

I can also confirm the action_space=4 vs 6 makes not a difference, both have good results.

Tested on Keras_V2, gpu.cuda8.cudnn5110 Ubuntu 16.04 and OSX 10.12.3.