IntelLabs / coach

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
Apache License 2.0
2.32k stars 460 forks source link

Adding Conditional RL feature #371

Open eslambakr opened 5 years ago

eslambakr commented 5 years ago

Could you guide me how to expand this awesome work to support conditional RL?

gal-leibovich commented 5 years ago

This is a bit too general, could you please provide more details?

eslambakr commented 5 years ago

@gal-leibovich Thanks for your quick response. The details as follows: 1- I have created a simple environment as seen in the figure below simple_car

2- I used your original code to train the agent to take the right turn and it succeed after 60k steps. 3- I want to expand the original code to make it supports the Conditional RL, Here you are my modification to achieve that: 3.1- I created a 3 heads instead of one in the preset file agent_params.network_wrappers['main'].heads_parameters = \ [DuelingQHeadParameters(rescale_gradient_from_head_by_factor=1 / math.sqrt(2)), DuelingQHeadParameters(rescale_gradient_from_head_by_factor=1 / math.sqrt(2)), DuelingQHeadParameters(rescale_gradient_from_head_by_factor=1 / math.sqrt(2))]

3.2- I decided to take the following approach, I will feed forward on the 3 heads but only back propagation on only the correct one by making the rest of gradients (gradients of the other two heads) Zeros so I duplicate the targets as follow in result = self.networks['main'].train_and_sync_networks(inputs=batch.states(network_keys), targets=[TD_targets, TD_targets, TD_targets], importance_weights=importance_weights) and choose the correct prediction from the correct head as follows in function) selected_actions = np.argmax( self.networks['main'].online_network.predict(batch.next_states(network_keys))[Config.direction], 1) 3.3- In value_optimization_agent (get_all_q_values_for_states function) I changed this line actions_q_values = self.get_prediction(states) to be like this to choose from the 3 actions actions_q_values = self.get_prediction(states)[Config.direction]

3.4- Then came the most important part to make the loss of the other two heads equal zeros in (get_model function) def get_head_1_loss(loss): return [loss[0]] def get_head_2_loss(loss): return [loss[1]] def get_head_3_loss(loss): return [loss[2]]

in (get_model function) if self.config.activate_3_heads: direction = tf.placeholder(tf.int32, name="direction") self.name_direction = direction self.losses ={tf.equal(direction, tf.constant(0)): lambda: get_head_1_loss(self.losses), tf.equal(direction, tf.constant(1)): lambda: get_head_2_loss(self.losses), tf.equal(direction, tf.constant(2)): lambda: get_head_3_loss(self.losses)}, exclusive=False)

3.5- In I add this line to remove the none gradients which cause because of step 3.4 if self.config.activate_3_heads: self.tensor_gradients = [x for x in self.tensor_gradients if x is not None]

3.6 Finally in (parallel_predict function) I choose the correct output according to which head is activated if config.activate_3_heads: fetches += [network.outputs[Config.direction]] else: fetches += network.outputs

But unfortunately the above modification failed. The 3 heads seem to be intersected with each others (My insights the problem is the back propagation is done on the 3 heads by mistake not only on the desired head)


eslambakr commented 5 years ago

Dear @gal-leibovich I will be thankful if you could tell me what the wrong with my approach or guide me to a simpler approach to be able to expand the code to N (eg:3) heads each one learn a different task.