PacktPublishing / Deep-Reinforcement-Learning-Hands-On-Second-Edition

Deep-Reinforcement-Learning-Hands-On-Second-Edition, published by Packt
MIT License
1.13k stars 531 forks source link

How to adapt validation_run from Ch10 for Categorical DQN? #54

Open PeterSenyszyn opened 2 years ago

PeterSenyszyn commented 2 years ago

Hello, I'm trying to adapt the examples from Ch 8 & 10 from the book into a Double-Dueling Categorical architecture using Conv1d from Ch 10. Training seems to work fine using ptan and Pytorch ignite. I want to run validation though using openai gym, so I was wondering how to determine the next action for a new observation batch. My understanding is that for the normal dueling/double Q Conv1d network we run a forward pass of the observation through the trained network for the Q values, which we maximize to find action_idx. When running an observation through the categorical architecture however the book states a forward pass "returns the predicted probability distribution as a 3D tensor (batch, actions, and supports)." For a bar size of 10 I see clearly in my output that I get a (1,3,51) shaped tensor. But dim=1 looks to be various weights, not integers. What additional steps do I need to take in order to get the next step to take for the openai gym? Thanks in advance, and happy to post more code if needed.

My model:

class PlatformDQNDistr(nn.Module):
      def __init__(self, input_shape, n_actions):
        super(PlatformDQNDistr, self).__init__()

        self.conv = nn.Sequential(
            nn.Conv1d(input_shape[0], 128, 5),
            nn.ReLU(),
            nn.Conv1d(128, 128, 5),
            nn.ReLU(),
        )
        conv_out_size = self._get_conv_out(input_shape)

        # We use Noisy networks rather than epsilon greedy action selection for exploration
        self.fc_val = nn.Sequential(
            NoisyFactorizedLinear(conv_out_size, 512),
            nn.ReLU(),
            NoisyFactorizedLinear(512, 1)
        )

        self.fc_adv = nn.Sequential(
            NoisyFactorizedLinear(conv_out_size, 512),
            nn.ReLU(),
            NoisyFactorizedLinear(512, n_actions * N_ATOMS)
        )
        sups = torch.arange(Vmin, Vmax + DELTA_Z, DELTA_Z)
        self.register_buffer("supports", sups)
        self.softmax = nn.Softmax(dim=1)

    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        batch_size = x.size()[0]
        conv_out = self.conv(x).view(batch_size, -1)  # convolve batch
        val = self.fc_val(conv_out)
        adv = self.fc_adv(conv_out)
        return (val + adv - adv.mean(dim=1, keepdim=True)).view(batch_size, -1, N_ATOMS)

    def both(self, x):
        cat_out = self(x)
        probs_distribution = self.apply_softmax(cat_out)
        weights = probs_distribution * self.supports
        res = weights.sum(dim=2)
        return cat_out, res

    def q_vals(self, x):
        return self.both(x)[1]

    def apply_softmax(self, t):
        return self.softmax(t.view(-1, N_ATOMS)).view(t.size())