Open eslambakr opened 5 years ago
This is a bit too general, could you please provide more details?
@gal-leibovich
Thanks for your quick response.
The details as follows:
1- I have created a simple environment as seen in the figure below
2- I used your original code to train the agent to take the right turn and it succeed after 60k steps.
3- I want to expand the original code to make it supports the Conditional RL, Here you are my modification to achieve that:
3.1- I created a 3 heads instead of one in the preset file
agent_params.network_wrappers['main'].heads_parameters = \ [DuelingQHeadParameters(rescale_gradient_from_head_by_factor=1 / math.sqrt(2)), DuelingQHeadParameters(rescale_gradient_from_head_by_factor=1 / math.sqrt(2)), DuelingQHeadParameters(rescale_gradient_from_head_by_factor=1 / math.sqrt(2))]
3.2- I decided to take the following approach, I will feed forward on the 3 heads but only back propagation on only the correct one by making the rest of gradients (gradients of the other two heads) Zeros
so I duplicate the targets as follow in ddqn_agent.py:
result = self.networks['main'].train_and_sync_networks(inputs=batch.states(network_keys), targets=[TD_targets, TD_targets, TD_targets], importance_weights=importance_weights)
and choose the correct prediction from the correct head as follows in ddqn_agent.py(learn_from_batch function)
selected_actions = np.argmax( self.networks['main'].online_network.predict(batch.next_states(network_keys))[Config.direction], 1)
3.3- In value_optimization_agent (get_all_q_values_for_states function) I changed this line
actions_q_values = self.get_prediction(states)
to be like this to choose from the 3 actions
actions_q_values = self.get_prediction(states)[Config.direction]
3.4- Then came the most important part to make the loss of the other two heads equal zeros in general_network.py (get_model function)
def get_head_1_loss(loss): return [loss[0]]
def get_head_2_loss(loss): return [loss[1]]
def get_head_3_loss(loss): return [loss[2]]
in general_network.py (get_model function)
if self.config.activate_3_heads: direction = tf.placeholder(tf.int32, name="direction") self.name_direction = direction self.losses = tf.case({tf.equal(direction, tf.constant(0)): lambda: get_head_1_loss(self.losses), tf.equal(direction, tf.constant(1)): lambda: get_head_2_loss(self.losses), tf.equal(direction, tf.constant(2)): lambda: get_head_3_loss(self.losses)}, exclusive=False)
3.5- In architecture.py I add this line to remove the none gradients which cause because of step 3.4
if self.config.activate_3_heads: self.tensor_gradients = [x for x in self.tensor_gradients if x is not None]
3.6 Finally in architecture.py (parallel_predict function)
I choose the correct output according to which head is activated
if config.activate_3_heads: fetches += [network.outputs[Config.direction]] else: fetches += network.outputs
But unfortunately the above modification failed. The 3 heads seem to be intersected with each others (My insights the problem is the back propagation is done on the 3 heads by mistake not only on the desired head)
Thanks
Dear @gal-leibovich I will be thankful if you could tell me what the wrong with my approach or guide me to a simpler approach to be able to expand the code to N (eg:3) heads each one learn a different task.
Could you guide me how to expand this awesome work to support conditional RL?