IntelLabs / coach

Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms
https://intellabs.github.io/coach/
Apache License 2.0
2.32k stars 460 forks source link

Adding Conditional RL feature #371

Open eslambakr opened 5 years ago

eslambakr commented 5 years ago

Could you guide me how to expand this awesome work to support conditional RL?

gal-leibovich commented 5 years ago

This is a bit too general, could you please provide more details?

eslambakr commented 5 years ago

@gal-leibovich Thanks for your quick response. The details as follows: 1- I have created a simple environment as seen in the figure below simple_car

2- I used your original code to train the agent to take the right turn and it succeed after 60k steps. 3- I want to expand the original code to make it supports the Conditional RL, Here you are my modification to achieve that: 3.1- I created a 3 heads instead of one in the preset file agent_params.network_wrappers['main'].heads_parameters = \ [DuelingQHeadParameters(rescale_gradient_from_head_by_factor=1 / math.sqrt(2)), DuelingQHeadParameters(rescale_gradient_from_head_by_factor=1 / math.sqrt(2)), DuelingQHeadParameters(rescale_gradient_from_head_by_factor=1 / math.sqrt(2))]

3.2- I decided to take the following approach, I will feed forward on the 3 heads but only back propagation on only the correct one by making the rest of gradients (gradients of the other two heads) Zeros so I duplicate the targets as follow in ddqn_agent.py: result = self.networks['main'].train_and_sync_networks(inputs=batch.states(network_keys), targets=[TD_targets, TD_targets, TD_targets], importance_weights=importance_weights) and choose the correct prediction from the correct head as follows in ddqn_agent.py(learn_from_batch function) selected_actions = np.argmax( self.networks['main'].online_network.predict(batch.next_states(network_keys))[Config.direction], 1) 3.3- In value_optimization_agent (get_all_q_values_for_states function) I changed this line actions_q_values = self.get_prediction(states) to be like this to choose from the 3 actions actions_q_values = self.get_prediction(states)[Config.direction]

3.4- Then came the most important part to make the loss of the other two heads equal zeros in general_network.py (get_model function) def get_head_1_loss(loss): return [loss[0]] def get_head_2_loss(loss): return [loss[1]] def get_head_3_loss(loss): return [loss[2]]

in general_network.py (get_model function) if self.config.activate_3_heads: direction = tf.placeholder(tf.int32, name="direction") self.name_direction = direction self.losses = tf.case({tf.equal(direction, tf.constant(0)): lambda: get_head_1_loss(self.losses), tf.equal(direction, tf.constant(1)): lambda: get_head_2_loss(self.losses), tf.equal(direction, tf.constant(2)): lambda: get_head_3_loss(self.losses)}, exclusive=False)

3.5- In architecture.py I add this line to remove the none gradients which cause because of step 3.4 if self.config.activate_3_heads: self.tensor_gradients = [x for x in self.tensor_gradients if x is not None]

3.6 Finally in architecture.py (parallel_predict function) I choose the correct output according to which head is activated if config.activate_3_heads: fetches += [network.outputs[Config.direction]] else: fetches += network.outputs

But unfortunately the above modification failed. The 3 heads seem to be intersected with each others (My insights the problem is the back propagation is done on the 3 heads by mistake not only on the desired head)

Thanks

eslambakr commented 5 years ago

Dear @gal-leibovich I will be thankful if you could tell me what the wrong with my approach or guide me to a simpler approach to be able to expand the code to N (eg:3) heads each one learn a different task.