Terrible training results for continuous state space + 32x32 grayscale input

MarcoMeter commented 6 years ago

Hello everybody,

I've got a very trivial environment for trying out image input in addition to the continuous state space.

The training works really well if the image is not added to the state space. While training with the image input, it seems to be that the agent gets stuck in corners. Most of the time (like 99%) no rewards are collected within an episode. It feels like that the agent doesn't really explore the environment.

Any thoughts?

awjuliani commented 6 years ago

Hi @MarcoMeter,

The combination of continuous control and camera input is actually much more difficult to train than camera and discrete, or continuous and a state vector. It may be the case that our PPO implementation is not robust enough to train the situation you have (even though it is relatively simple). My recommendation would be to try using a large buffer and batch size, and small learning rate.

Very interested it hear if you are able to get it working though.

MarcoMeter commented 6 years ago

Thanks for your reply @awjuliani

I got better results on training a more complex environment: Ball Labyrinth. The agent manages to get to lesson 3. But it's pretty unstable because sometimes it won't even accomplish the first lesson.

MarcoMeter commented 6 years ago

I ran a few sessions.

I did 8 sessions with these parameters:

batch size = 128 buffer size = 2048 hidden layers = 2 hidden units = 64 learning rate = 1e-4

The first three results converged after 80k, 100k or 190k steps. The other 5 sessions didn't converge (stopped after 200k).

Then I tried a fewer hidden notes, smaller learning rate, bigger buffer size. Occasionally a good behavior converges eventually, but if I rerun the training a few times suing the same parameters, the training does not end up well most of the time. So the training is very unstable.

edit: Using 3 or 9 agents doesn't really make a difference.

MarcoMeter commented 6 years ago

I started to examine the camera's rendered textures in Brain.cs:395. Most of the (32x32) textures don't show the red agent, because the agent is to small. I increased the texture size to 48x48 to make sure that the agent is featured within that representation. This resolution is still too small to represent the agent and its target perfectly, because sometimes the agent is represented by just 1, 2 or 4 pixels. The target is a little bigger, but the same inconsistent behavior is present. Of course this is just logical that detail is lost on such a low scale, but this might be the major driver of getting such unstable training results.

Edit:

Scaling the observation up to 64x64 and using these hyperparameters

beta = 1e-2 batch size = 192 buffer size = 3072 hidden layers = 2 hidden units = 64 learning rate = 1e-4

training results are pretty good. They usually don't need more than 100k steps to converge.

Edit2:

I'm trying to get proper results for not including the direction vector in the input space. So far the training does not go well.

awjuliani commented 6 years ago

Thanks for sharing the results, @MarcoMeter! It is nice to hear that you were able to train image-based network w/ continuous control (sometimes at least). Needing a larger image makes complete sense. Please keep sharing, as it will be helpful for others in the future, including us at Unity.

MarcoMeter commented 6 years ago

I'm still trying to make it work for 64x64 greyscale plus the velocity only. Before, the input space included the direction vector to the target and the distance to it.

MarcoMeter commented 6 years ago

@awjuliani

Running several different training sessions for millions of steps, I observe the same set of mean rewards and standard deviations.

Step: 20000. Mean Reward: 0.4166666666666667. Std of Reward: 0.6400954789890506.
Step: 40000. Mean Reward: 0.16666666666666666. Std of Reward: 0.372677996249965.
Step: 60000. Mean Reward: 0.3333333333333333. Std of Reward: 0.4714045207910317.
Step: 80000. Mean Reward: 0.4166666666666667. Std of Reward: 0.6400954789890506.
Step: 100000. Mean Reward: 0.08333333333333333. Std of Reward: 0.2763853991962833.
Step: 120000. Mean Reward: 0.8333333333333334. Std of Reward: 0.6871842709362768.
Step: 140000. Mean Reward: 0.3333333333333333. Std of Reward: 0.4714045207910317.
Step: 160000. Mean Reward: 0.16666666666666666. Std of Reward: 0.3726779962499649.
Step: 180000. Mean Reward: 0.25. Std of Reward: 0.4330127018922193.
Step: 200000. Mean Reward: 0.5. Std of Reward: 0.6454972243679028.
Step: 220000. Mean Reward: 0.6666666666666666. Std of Reward: 0.7453559924999298.
Step: 240000. Mean Reward: 0.3333333333333333. Std of Reward: 0.7453559924999298.

I'm wondering if there is a bug inside the image processing of the python code. The observations captured inside of Unity look alright.

I tried to serialize the processed image by using the function scipy.misc.imsave() within the function _process_pixels() environment.py:176. For some reason the numpy array s causes an exception inside imsave() (I guess imsave expects a different shape). Do you know of a way to checkout the observation input before it is passed to the neural net for training?

edit: Same behavior applies to discrete actions, too.

edit2: The rgb image looks right. I didn't find a proper function to plot the single channel greyscale version.

kwea123 commented 6 years ago

I have an simple autonomous car that learns extremely well. You can check my code here https://github.com/kwea123/RL/blob/master/ai/unity_test/autocar/autocar.ipynb and the video https://www.youtube.com/watch?v=pHsxddQF0Tc

In short to answer your question, to visualize the camera input, use jupyter notebook, load the environment, reset it, and then call observation

env_name = "autocar"
env = UnityEnvironment(file_name=env_name)
default_brain = env.brain_names[0]

env_info = env.reset(train_mode=False)[default_brain]

plt.imshow(env_info.observations[0][0])
plt.show()

Note : this is the old version, the loading environment API seems already changed in the new version, but I guess env_info.observations[0][0] should still be the same.

By the way I have the velocity state concatenated to the image state, so that shouldn't be a problem.

MarcoMeter commented 6 years ago

Hi @kwea123 thanks for sharing!

I have the velocity state concatenated to the image state

Did you add the velocity to the continuous state space or did you do it differently?

Based on your code, you are feeding the edges of the observation to the neural net, right?

kwea123 commented 6 years ago

Yes, this is what the states look like: untitled

I use opencv to detect the edges of the camera image, the edges image (black and white) of size 16x40 = 640 is then flattened, concatenated with the velocity state (so an input size 1*641), and fed to the neural net.

In your game I think you need the color as well, and it's probably better to use a CNN first to extract the features of the image before concatenating with other non-image states (rotation, speed, etc). I didn't use CNN simply because my PC runs so slow... and surprisingly using a flattened input worked so well, I think maybe it's because the input is so easy (black and white only).

MarcoMeter commented 6 years ago

In order to plot the greyscale image I had to discard the last dimension.

# Alternative
#plt.imshow(info.observations[0][0][..., 0])
plt.imshow(np.squeeze(info.observations[0][0][..., 0]))
plt.show()

This is what the result looks like (doesn't really look like greyscale). It's just to mention that I increased the scale of the target and the agent. So a 32x32 is detailed enough. Also, I constrained the environment by a square (before: 16:9 rectangle).

I guess I have to dive deeper into the PPO implementation and CNNs in order to get closer to reasonable training results.

kwea123 commented 6 years ago

What do you mean

In order to plot the greyscale image I had to discard the last dimension.

cause the image comes with 3 channels, RGB, and from your code info.observations[0][0][..., 0] it seems that you extract the last channel (which is the B channel) instead of discarding it.

Anyway, if you want a greyscale image, I will suggest that you use opencv to convert it. Here's a stackflow question with many ways https://stackoverflow.com/questions/12201577/how-can-i-convert-an-rgb-image-into-grayscale-in-python but I personally use opencv and do gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

When you plot a greyscale image with plt you must specify the color map cmap, do plt.imshow(img, cmap='gray').

It's a good idea to see how you can improve your model by looking into their PPO implementation, especially the image (observation) part, do they use CNN? How many layers? These questions might help you to improve your model!

Finally I'll leave my work with a detailed description: https://github.com/kwea123/RL/tree/master/ai/unity_test/autocar

Good luck!

MarcoMeter commented 6 years ago

This repository's PPO implementation outputs a numpy array which has a shape of (32,32,1). So I had to ditch the last dimension to plot the observation.

Just added the cmap argument. gs2

The CNN implementation of this repository's model is found in models.py:76.

def create_visual_encoder(self, o_size_h, o_size_w, bw, h_size, num_streams, activation, num_layers):
        """
        Builds a set of visual (CNN) encoders.
        :param o_size_h: Height observation size.
        :param o_size_w: Width observation size.
        :param bw: Whether image is greyscale {True} or color {False}.
        :param h_size: Hidden layer size.
        :param num_streams: Number of visual streams to construct.
        :param activation: What type of activation function to use for layers.
        :return: List of hidden layer tensors.
        """
        if bw:
            c_channels = 1
        else:
            c_channels = 3

        self.observation_in = tf.placeholder(shape=[None, o_size_h, o_size_w, c_channels], dtype=tf.float32,
                                             name='observation_0')
        streams = []
        for i in range(num_streams):
            self.conv1 = tf.layers.conv2d(self.observation_in, 16, kernel_size=[8, 8], strides=[4, 4],
                                          use_bias=False, activation=activation)
            self.conv2 = tf.layers.conv2d(self.conv1, 32, kernel_size=[4, 4], strides=[2, 2],
                                          use_bias=False, activation=activation)
            hidden = c_layers.flatten(self.conv2)
            for j in range(num_layers):
                hidden = tf.layers.dense(hidden, h_size, use_bias=False, activation=activation)
            streams.append(hidden)
return streams

kwea123 commented 6 years ago

Oh you checked the box "Black and W" for the camera? Then that's good. OK they use CNN, so now just try to experiment different hyperparameters..., well personally I will first try a LeNet-similar thing because it's the simplest.

Several things I would do to modify this structure:

decrease the stride to at most 2
add maxpool after conv2d.
set use_bias=True

You can test your own.

MarcoMeter commented 6 years ago

Thanks a lot for your recommendations. I'll do some exercises with respect to CNN and Tensorflow to get a better understanding.

I get back to this issue once I make progress or get more insights.

MarcoMeter commented 6 years ago

This is basically how I setup the conv and pooling layers

self.conv1 = tf.layers.conv2d(self.observation_in, 16, kernel_size=[6, 6], padding="same", strides=[2, 2],
                                          use_bias=True, activation=activation)
            self.pool1 = tf.layers.max_pooling2d(inputs=self.conv1, pool_size=[2, 2], strides=2)
            self.conv2 = tf.layers.conv2d(self.pool1, 32, kernel_size=[4, 4], padding="same", strides=[2, 2],
                                          use_bias=True, activation=activation)
            self.pool2 = tf.layers.max_pooling2d(inputs=self.conv2, pool_size=[2, 2], strides=2)
            hidden = c_layers.flatten(self.pool2)

However I modified the number of filters, their size, the strides and the pooling layers. I did not achieve a suitable result. I also tried different batch sizes, buffer sizes and learning rates.

There is one strange observation though. If the input space consists of the direction vector to the target and the distance to it, the agent easily learns a behavior within 80k-100k steps. It doesn't matter how I setup the conv and pooling layers. Just to recall, the training doesn't work out if the agent shall learn from observation and its velocity.

My next step is to look into the new implementation of the GridWorld example. Maybe that helps to find a solution.

Any thoughts on that @awjuliani and @kwea123 ?

kwea123 commented 6 years ago

At first sight, I think a simple neural net as what you described can solve the problem, because the problem itself is simple as well. And I don't think either tuning hyperparameters is essential.

Is the state only the image? Or do you have other non-image states (speed, distance, etc)? How are they combined with the image state?

I can create a similar environment and test on my PC later

edit: Yeah if the state is only the distance and and the direction it should be much simpler

MarcoMeter commented 6 years ago

I can create a similar environment and test on my PC later

You can find the environment here.

Is the state only the image?

As frame stacking is not included yet, the velocity of the agent is added to the state.

How are they combined with the image state?

It looks like that this is done at models.py:227.

edit: I also tried using 4 discrete actions instead of 2 continuous one.

saporter commented 6 years ago

@MarcoMeter - your examples are super helpful. Thank you! Since it looks like you have some stable trainings maybe you can answer something related to this topic:

Have you found that the timescale does not effect the observation samples being taken for the agent? My worry is that setting a timescale to 100, but not having a framerate that can keep up, would feed the agent outdated visual information.

But this seemed to not affect you? I noticed you set targetframerate = -1 (default for platform)...

(Happy to start a new thread if this is too off topic here)

MarcoMeter commented 6 years ago

Hi @saporter I'm pretty sure that this is not a problem, because each camera component, which is referenced for an observation, is specifically invoked to render. See Brain.cs:420.

saporter commented 6 years ago

Ah thank you. I guess I was confused by the doc's recommendation in "Making a New Learning Environment" where it suggests increasing frame rate for those agents using observations. But the training I'm running now seems to be working just fine! Thanks again.

vladimiroster commented 6 years ago

Hi @MarcoMeter , this thread has been inactive for a while. Are you still having trouble? I'll close this issue, but feel free to reopen if you need more help with this.

MarcoMeter commented 6 years ago

I did work on this issue during the past week and still couldn't get the environment working by using a camera observation. Well, basically I tried new hyperparameters and the ml-agents version 0.3. Also, I rotated the target by 45° so that it looks differently.

btw: Can't reopen.

vladimiroster commented 6 years ago

Excellent. I've reopened the issue for you.

MarcoMeter commented 6 years ago

Excellent. I've reopened the issue for you. Thanks.

So this is what I tried today:

Doubled the size of the target and the agent

Tried varying hyperparameters, these are the latest:

SCCBrain:
batch_size: 32
beta: 3.0e-3
buffer_size: 256
epsilon: 0.35
hidden_units: 256
learning_rate: 2.0e-4
max_steps: 2.0e6
num_epoch: 5
num_layers: 2
time_horizon: 64
summary_freq: 10000

These hyperparameters are completely unstable (entropy tends to go down and policy loss tends to improve)

I might adjust the CNN setup.

Does anybody know of another ml-agents example (besides GridWorld), which successfully utilizes a camera observation?

sccvis

vladimiroster commented 6 years ago

There are a couple more that use optional camera observations. Feel free to take a look at the Banana Collector and the Hallway environments.

mmattar commented 6 years ago

Hi @MarcoMeter - let us know if you have any additional questions here. Thanks.

MarcoMeter commented 6 years ago

I still wasn't able to solve this "low scale" environment. The other environments besides GridWorld do not purely rely on image input.

I'm wondering if the CNN topology is not sufficient.

It would be nice if someone could share a few thoughts on approaching this.

MarcoMeter commented 6 years ago

One thing is really suspicious. Whenever the agent is supposed to be trained on visual observations only, it either tends to move right or left only. So during training, the agent is either stuck to the right or to the left wall. This is also the behavior for setting epsilon to 0.5.

Just for clarification, for the visual observation training I'm using discrete actions. And also I added max pooling layers after each convolution layer.

MarcoMeter commented 6 years ago

@vincentpierre Thanks for the patch.

The patch included adjusted hyperparameters, a punishing reward signal for each step (-0.001) and it lets the episode terminate after reaching the target. I'll update my repo on this concern soon.

After all, this issue can be finally closed.

trainingplot

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Unity-Technologies / ml-agents

Terrible training results for continuous state space + 32x32 grayscale input #198