[Question] Concatenating different segments of observation data in different layers within CNN + FC network

akmandor commented 3 years ago

Question

Let's say I have 1-D vector of observations with n (516) data. I would like to pass the first k (512) data to a CNN network. Having the output from the CNN network and concatenating with the rest of my n-k (4) data, I would like to pass them into a FC network.

Main question: What is the right way to implement this custom network within the stable-baselines3 architecture?

My approaches and side questions:

First, I tried to implement using customized CNN policy, but it does not seem working for my case (Please check "Additional context" section below for the details).
Then, I noticed that instead of setting the whole (n) data as a vector (gym.spaces.box), we can set our observation space as dict as in "Multiple Inputs and Dictionary Observations" section in the documentation. Although the provided example seems relevant to my case, there is no concatenation of the given inputs (image and vector) until the return. Hence, I wonder if I should use MultiInputPolicy rather than CnnPolicy?
If I should use the MultiInputPolicy and let's say I have two inputs with these dict keys: 'obs1' and 'obs2'. Then how can I concatenate the output of the CNN with the second observation using the given guideline in "Multiple Inputs and Dictionary Observations" section? For example; is the following coding correct? Also, in the example, "self._features_dim" is updated based on the given observations, but isn't that depend on the action space, which is given by the input directly? Or is it specific to that example? I am confused.

class CustomCombinedExtractor(BaseFeaturesExtractor):
    def __init__(self, observation_space: gym.spaces.Dict):
        # We do not know features-dim here before going over all the items,
        # so put something dummy for now. PyTorch requires calling
        # nn.Module.__init__ before adding modules
        super(CustomCombinedExtractor, self).__init__(observation_space, features_dim=1)

        n_channel_input1 = 1
        n_channel_output1 = 32

        n_channel_input2 = n_channel_output1
        n_channel_output2 = 32

        n_channel_input3 = n_channel_output2
        n_channel_output3 = 32

        self.cnn_net = nn.Sequential(
            nn.Conv1d(n_channel_input1, n_channel_output1, kernel_size=2, stride=2),
            nn.ReLU(),
            nn.Conv1d(n_channel_input2, n_channel_output2, kernel_size=2, stride=2),
            nn.ReLU(),
            nn.Conv1d(n_channel_input3, n_channel_output3, kernel_size=2, stride=2),
            nn.ReLU(),
            nn.Flatten(),
        )

        # Compute shape by doing one forward pass
        with th.no_grad():
            observation_space_sample = observation_space.sample()[None]
            n_flatten = self.cnn_net(th.as_tensor(observation_space_sample[:, :, :len(observations["obs1"])]).float()).shape[1] + len(observations["obs2"])

        self.fc_net = nn.Sequential(
            nn.Linear(n_flatten, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU()
        )

        # Update the features dim manually
        self._features_dim = features_dim ???

    def forward(self, observations) -> th.Tensor:
        cnn_output = self.cnn_net(observations["obs1"])
        fc_input = th.cat(cnn_output, observations["obs2"], dim=1)

        return self.fc_net(fc_input)

Since it is not explicitly given in the documentation's "Multiple Inputs and Dictionary Observations" section: How should I properly call this custom MultiInputPolicy network in the main? Is the following right way to embed this policy in our model?

policy_kwargs = dict(features_extractor_class=CustomCombinedExtractor, features_extractor_kwargs=dict(features_dim=n_actions),)

model = PPO("MultiInputPolicy", env, learning_rate=learning_rate, n_steps=n_steps, batch_size=batch_size, ent_coef=ent_coef, tensorboard_log=tensorboard_log_path, policy_kwargs=policy_kwargs, device="cuda", verbose=1)

Additional context

Using the guidelines in "Custom Policy Network" in the documentation, I implemented the following custom policy:

class Custom1DCNNPolicy(BaseFeaturesExtractor):

    def __init__(self, observation_space, features_dim: int = 128):
        super(Custom1DCNNPolicy, self).__init__(observation_space, features_dim)

        self.cnn_input_data_len = 512
        self.fc_input_extra_len = observation_space.shape[1] - self.cnn_input_data_len

        n_channel_input1 = 3
        n_channel_output1 = 32

        n_channel_input2 = n_channel_output1
        n_channel_output2 = 32

        n_channel_input3 = n_channel_output2
        n_channel_output3 = 32

        self.cnn = nn.Sequential(
            nn.Conv1d(n_channel_input1, n_channel_output1, kernel_size=2, stride=2),
            nn.ReLU(),
            nn.Conv1d(n_channel_input2, n_channel_output2, kernel_size=2, stride=2),
            nn.ReLU(),
            nn.Conv1d(n_channel_input3, n_channel_output3, kernel_size=2, stride=2),
            nn.ReLU(),
            nn.Flatten(),
        )

        # Compute shape by doing one forward pass
        with th.no_grad():
            observation_space_sample = observation_space.sample()[None]
            n_flatten = self.cnn(th.as_tensor(observation_space_sample[:, :, :self.cnn_input_data_len]).float()).shape[1] + self.fc_input_extra_len

        self.linear = nn.Sequential(
            nn.Linear(n_flatten, 100),
            nn.ReLU(),
            nn.Linear(100, features_dim),
            nn.ReLU()
        )

    def forward(self, observations: th.Tensor) -> th.Tensor:
        cnn_output = self.cnn(observations[:, :, :self.cnn_input_data_len])
        second_data = observations[:, -1, self.cnn_input_data_len:]
        fc_input = th.cat((cnn_output, second_data), dim=1)

        return self.linear(fc_input)

I set my model using the Custom1DCNNPolicy as following:

policy_kwargs = dict(features_extractor_class=Custom1DCNNPolicy, features_extractor_kwargs=dict(features_dim=n_actions),)

model = PPO("CnnPolicy", env, learning_rate=learning_rate, n_steps=n_steps, batch_size=batch_size, ent_coef=ent_coef, tensorboard_log=tensorboard_log_path, policy_kwargs=policy_kwargs, device="cuda", verbose=1)

However, the network is failed to learn the task as shown in the following result plot:

training_result_cnn_fc_1

In order to check the validity of the data (observations), I trained only using the FC network and in that case the result is successful as in the following plot:

training_result_fc

I also tried to train using different parameters (learning rate, channel inputs outputs, kernel sizes, etc.), but the results are very similar to the failing plot above.

training_result_cnn_fc_2 training_result_cnn_fc_3

Please also note that my desired network architecture has already implemented in Stable Baselines using CnnPolicy as in https://stable-baselines.readthedocs.io/en/master/misc/projects.html#train-a-ros-integrated-mobile-robot-differential-drive-to-avoid-dynamic-objects with the custom policy class given below.

In this implementation example, the input observation is the concatenated laser scan data and waypoints as a 1-D vector. The first 3 layers are defined as 1-D CNN where layer 4 and 5 are FC. The laser scan length of input observation is fed into 3 layers and then the output is concatenated with the rest of observations (1-D vectorized waypoints data) and feed into the 2 FC layers.

def laser_cnn_multi_input(state, **kwargs):
    """
    1D Conv Network
    :param state: (TensorFlow Tensor) state input placeholder
    :param kwargs: (dict) Extra keywords parameters for the convolutional layers of the CNN
    :return: (TensorFlow Tensor) The CNN output layer
    """
    # scan = tf.squeeze(state[:, : , 0:kwargs['laser_scan_len'] , :], axis=1)
    scan = tf.squeeze(state[:, : , 0:kwargs['laser_scan_len'] , :], axis=1)
    wps = tf.squeeze(state[:, :, kwargs['laser_scan_len']:, -1], axis=1)
    # goal = tf.math.multiply(goal, 6)

    kwargs_conv = {}
    activ = tf.nn.relu
    layer_1 = activ(conv1d(scan, 'c1d_1', n_filters=32, filter_size=5, stride=2, init_scale=np.sqrt(2), **kwargs_conv))
    layer_2 = activ(conv1d(layer_1, 'c1d_2', n_filters=64, filter_size=3, stride=2, init_scale=np.sqrt(2), **kwargs_conv))
    layer_2 = conv_to_fc(layer_2)
    layer_3 = activ(linear(layer_2, 'fc1', n_hidden=256, init_scale=np.sqrt(2)))
    temp = tf.concat([layer_3, wps], 1)
    layer_4 = activ(linear(temp, 'fc2', n_hidden=128, init_scale=np.sqrt(2)))
    return layer_4

class CNN1DPolicy_multi_input(common.FeedForwardPolicy):
    """
    This class provides a 1D convolutional network for the Raw Data Representation
    """
    def __init__(self, *args, **kwargs):
        kwargs["laser_scan_len"] = rospy.get_param("%s/rl_agent/scan_size"%NS, 360)
        super(CNN1DPolicy_multi_input, self).__init__(*args, **kwargs, cnn_extractor=laser_cnn_multi_input, feature_extraction="cnn")

Checklist

[X] I have read the documentation (required)
[X] I have checked that there is no similar issue in the repo (required)

Miffyli commented 3 years ago

MultiInputPolicy with a custom feature extractor is what you are looking for, yes. You can specify how each of the observation keys are treated (CNN for observation in "key1" and then concatenate it with "key2" observation). Docs have an example on how to do this: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html#multiple-inputs-and-dictionary-observations

akmandor commented 3 years ago

But it is not clear in docs how to create the model with a custom MultiInputPolicy.

1) When I create the model as below:

policy_kwargs = dict(features_extractor_class=CustomCombinedExtractor, features_extractor_kwargs=dict(features_dim=n_actions),)
model = PPO("MultiInputPolicy", env)

I got the following error:

"Error: unknown policy type MultiInputPolicy,the only registed policy type are: ['MlpPolicy', 'CnnPolicy']!"

2) Instead of "MultiInputPolicy", if I use the class name as below:

model = PPO(CustomCombinedExtractor, env)

I get the following error:

Traceback (most recent call last): File ".../training.py", line 335, in model = PPO(CustomCombinedExtractor, env, learning_rate=learning_rate, n_steps=n_steps, batch_size=batch_size, ent_coef=ent_coef, tensorboard_log=tensorboard_log_path, device="cuda", verbose=1) File "/home/akmandor/.local/lib/python3.8/site-packages/stable_baselines3/ppo/ppo.py", line 95, in init super(PPO, self).init( File "/home/akmandor/.local/lib/python3.8/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 76, in init super(OnPolicyAlgorithm, self).init( File "/home/akmandor/.local/lib/python3.8/site-packages/stable_baselines3/common/base_class.py", line 156, in init env = self._wrap_env(env, self.verbose, monitor_wrapper) File "/home/akmandor/.local/lib/python3.8/site-packages/stable_baselines3/common/base_class.py", line 209, in _wrap_env env = ObsDictWrapper(env) File "/home/akmandor/.local/lib/python3.8/site-packages/stable_baselines3/common/vec_env/obs_dict_wrapper.py", line 28, in init self.obs_dim = venv.observation_space.spaces["observation"].shape[0] KeyError: 'observation'

araffin commented 3 years ago

you need to upgrade your SB3 version. please format your code using markdown as shown in the issue template

DLR-RM / stable-baselines3