DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.74k stars 1.66k forks source link

[Question] Concatenating different segments of observation data in different layers within CNN + FC network #556

Closed akmandor closed 2 years ago

akmandor commented 3 years ago

Question

Let's say I have 1-D vector of observations with n (516) data. I would like to pass the first k (512) data to a CNN network. Having the output from the CNN network and concatenating with the rest of my n-k (4) data, I would like to pass them into a FC network.

Main question: What is the right way to implement this custom network within the stable-baselines3 architecture?

My approaches and side questions:

class CustomCombinedExtractor(BaseFeaturesExtractor):
    def __init__(self, observation_space: gym.spaces.Dict):
        # We do not know features-dim here before going over all the items,
        # so put something dummy for now. PyTorch requires calling
        # nn.Module.__init__ before adding modules
        super(CustomCombinedExtractor, self).__init__(observation_space, features_dim=1)

        n_channel_input1 = 1
        n_channel_output1 = 32

        n_channel_input2 = n_channel_output1
        n_channel_output2 = 32

        n_channel_input3 = n_channel_output2
        n_channel_output3 = 32

        self.cnn_net = nn.Sequential(
            nn.Conv1d(n_channel_input1, n_channel_output1, kernel_size=2, stride=2),
            nn.ReLU(),
            nn.Conv1d(n_channel_input2, n_channel_output2, kernel_size=2, stride=2),
            nn.ReLU(),
            nn.Conv1d(n_channel_input3, n_channel_output3, kernel_size=2, stride=2),
            nn.ReLU(),
            nn.Flatten(),
        )

        # Compute shape by doing one forward pass
        with th.no_grad():
            observation_space_sample = observation_space.sample()[None]
            n_flatten = self.cnn_net(th.as_tensor(observation_space_sample[:, :, :len(observations["obs1"])]).float()).shape[1] + len(observations["obs2"])

        self.fc_net = nn.Sequential(
            nn.Linear(n_flatten, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU()
        )

        # Update the features dim manually
        self._features_dim = features_dim ???

    def forward(self, observations) -> th.Tensor:
        cnn_output = self.cnn_net(observations["obs1"])
        fc_input = th.cat(cnn_output, observations["obs2"], dim=1)

        return self.fc_net(fc_input)
policy_kwargs = dict(features_extractor_class=CustomCombinedExtractor, features_extractor_kwargs=dict(features_dim=n_actions),)

model = PPO("MultiInputPolicy", env, learning_rate=learning_rate, n_steps=n_steps, batch_size=batch_size, ent_coef=ent_coef, tensorboard_log=tensorboard_log_path, policy_kwargs=policy_kwargs, device="cuda", verbose=1)

Additional context

Using the guidelines in "Custom Policy Network" in the documentation, I implemented the following custom policy:

class Custom1DCNNPolicy(BaseFeaturesExtractor):

    def __init__(self, observation_space, features_dim: int = 128):
        super(Custom1DCNNPolicy, self).__init__(observation_space, features_dim)

        self.cnn_input_data_len = 512
        self.fc_input_extra_len = observation_space.shape[1] - self.cnn_input_data_len

        n_channel_input1 = 3
        n_channel_output1 = 32

        n_channel_input2 = n_channel_output1
        n_channel_output2 = 32

        n_channel_input3 = n_channel_output2
        n_channel_output3 = 32

        self.cnn = nn.Sequential(
            nn.Conv1d(n_channel_input1, n_channel_output1, kernel_size=2, stride=2),
            nn.ReLU(),
            nn.Conv1d(n_channel_input2, n_channel_output2, kernel_size=2, stride=2),
            nn.ReLU(),
            nn.Conv1d(n_channel_input3, n_channel_output3, kernel_size=2, stride=2),
            nn.ReLU(),
            nn.Flatten(),
        )

        # Compute shape by doing one forward pass
        with th.no_grad():
            observation_space_sample = observation_space.sample()[None]
            n_flatten = self.cnn(th.as_tensor(observation_space_sample[:, :, :self.cnn_input_data_len]).float()).shape[1] + self.fc_input_extra_len

        self.linear = nn.Sequential(
            nn.Linear(n_flatten, 100),
            nn.ReLU(),
            nn.Linear(100, features_dim),
            nn.ReLU()
        )

    def forward(self, observations: th.Tensor) -> th.Tensor:
        cnn_output = self.cnn(observations[:, :, :self.cnn_input_data_len])
        second_data = observations[:, -1, self.cnn_input_data_len:]
        fc_input = th.cat((cnn_output, second_data), dim=1)

        return self.linear(fc_input)

I set my model using the Custom1DCNNPolicy as following:

policy_kwargs = dict(features_extractor_class=Custom1DCNNPolicy, features_extractor_kwargs=dict(features_dim=n_actions),)

model = PPO("CnnPolicy", env, learning_rate=learning_rate, n_steps=n_steps, batch_size=batch_size, ent_coef=ent_coef, tensorboard_log=tensorboard_log_path, policy_kwargs=policy_kwargs, device="cuda", verbose=1)

However, the network is failed to learn the task as shown in the following result plot:

training_result_cnn_fc_1

In order to check the validity of the data (observations), I trained only using the FC network and in that case the result is successful as in the following plot:

training_result_fc

I also tried to train using different parameters (learning rate, channel inputs outputs, kernel sizes, etc.), but the results are very similar to the failing plot above.

training_result_cnn_fc_2 training_result_cnn_fc_3

Please also note that my desired network architecture has already implemented in Stable Baselines using CnnPolicy as in https://stable-baselines.readthedocs.io/en/master/misc/projects.html#train-a-ros-integrated-mobile-robot-differential-drive-to-avoid-dynamic-objects with the custom policy class given below.

In this implementation example, the input observation is the concatenated laser scan data and waypoints as a 1-D vector. The first 3 layers are defined as 1-D CNN where layer 4 and 5 are FC. The laser scan length of input observation is fed into 3 layers and then the output is concatenated with the rest of observations (1-D vectorized waypoints data) and feed into the 2 FC layers.

def laser_cnn_multi_input(state, **kwargs):
    """
    1D Conv Network
    :param state: (TensorFlow Tensor) state input placeholder
    :param kwargs: (dict) Extra keywords parameters for the convolutional layers of the CNN
    :return: (TensorFlow Tensor) The CNN output layer
    """
    # scan = tf.squeeze(state[:, : , 0:kwargs['laser_scan_len'] , :], axis=1)
    scan = tf.squeeze(state[:, : , 0:kwargs['laser_scan_len'] , :], axis=1)
    wps = tf.squeeze(state[:, :, kwargs['laser_scan_len']:, -1], axis=1)
    # goal = tf.math.multiply(goal, 6)

    kwargs_conv = {}
    activ = tf.nn.relu
    layer_1 = activ(conv1d(scan, 'c1d_1', n_filters=32, filter_size=5, stride=2, init_scale=np.sqrt(2), **kwargs_conv))
    layer_2 = activ(conv1d(layer_1, 'c1d_2', n_filters=64, filter_size=3, stride=2, init_scale=np.sqrt(2), **kwargs_conv))
    layer_2 = conv_to_fc(layer_2)
    layer_3 = activ(linear(layer_2, 'fc1', n_hidden=256, init_scale=np.sqrt(2)))
    temp = tf.concat([layer_3, wps], 1)
    layer_4 = activ(linear(temp, 'fc2', n_hidden=128, init_scale=np.sqrt(2)))
    return layer_4

class CNN1DPolicy_multi_input(common.FeedForwardPolicy):
    """
    This class provides a 1D convolutional network for the Raw Data Representation
    """
    def __init__(self, *args, **kwargs):
        kwargs["laser_scan_len"] = rospy.get_param("%s/rl_agent/scan_size"%NS, 360)
        super(CNN1DPolicy_multi_input, self).__init__(*args, **kwargs, cnn_extractor=laser_cnn_multi_input, feature_extraction="cnn")

Checklist

Miffyli commented 3 years ago

MultiInputPolicy with a custom feature extractor is what you are looking for, yes. You can specify how each of the observation keys are treated (CNN for observation in "key1" and then concatenate it with "key2" observation). Docs have an example on how to do this: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html#multiple-inputs-and-dictionary-observations

akmandor commented 3 years ago

But it is not clear in docs how to create the model with a custom MultiInputPolicy.

1) When I create the model as below:

policy_kwargs = dict(features_extractor_class=CustomCombinedExtractor, features_extractor_kwargs=dict(features_dim=n_actions),)
model = PPO("MultiInputPolicy", env)

I got the following error:

"Error: unknown policy type MultiInputPolicy,the only registed policy type are: ['MlpPolicy', 'CnnPolicy']!"

2) Instead of "MultiInputPolicy", if I use the class name as below:

model = PPO(CustomCombinedExtractor, env)

I get the following error:

Traceback (most recent call last): File ".../training.py", line 335, in model = PPO(CustomCombinedExtractor, env, learning_rate=learning_rate, n_steps=n_steps, batch_size=batch_size, ent_coef=ent_coef, tensorboard_log=tensorboard_log_path, device="cuda", verbose=1) File "/home/akmandor/.local/lib/python3.8/site-packages/stable_baselines3/ppo/ppo.py", line 95, in init super(PPO, self).init( File "/home/akmandor/.local/lib/python3.8/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 76, in init super(OnPolicyAlgorithm, self).init( File "/home/akmandor/.local/lib/python3.8/site-packages/stable_baselines3/common/base_class.py", line 156, in init env = self._wrap_env(env, self.verbose, monitor_wrapper) File "/home/akmandor/.local/lib/python3.8/site-packages/stable_baselines3/common/base_class.py", line 209, in _wrap_env env = ObsDictWrapper(env) File "/home/akmandor/.local/lib/python3.8/site-packages/stable_baselines3/common/vec_env/obs_dict_wrapper.py", line 28, in init self.obs_dim = venv.observation_space.spaces["observation"].shape[0] KeyError: 'observation'

araffin commented 3 years ago

you need to upgrade your SB3 version. please format your code using markdown as shown in the issue template