Further improvement of using trained agents in production

Hey all,

After using rl_coach for some days now, I have trained some models that seem promising. Related directly to issue #71 I have tried to do this with tensorflow. As suggested in the referenced issue, TF Serving can be used to accomplish it. However, on my side I don't need to go online (and I think that many users won't need it also) so something like:

Loading graph > Loading weights/parameters > Performing the operation (i.e. somekind of .act() or just .run() the op in a tf.Session) would be sufficient.

I have tried this path using the different checkpoints saved. Let me share some code:

    ### Create the session in which we will run.
    tensorflowSess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True))

    ############# LOADING #############
    ### First let's load meta graph:
    ### NOTE: Restarting training from saved meta_graph only works if the device assignments have not changed > allow_soft_placement=True
    restorerObject = tf.train.import_meta_graph(metaGraphPath)

    ### Then, restore weights (paramterers) of that graph:
    restorerObject.restore(tensorflowSess, ckptFilePath)

    ### Finally, get the operations we want to run and create the feed_dict:
    restoredGraph = tf.get_default_graph()

    '''If we want to return a value, 
      we need to get the tensor of that operation (whatever:0/1...) 
      because the tensor is the thing that holds the returned value, 
      not directly the operation we get with get_operation_by_name.
      Furthermore, it seems that taking the last operation of the graph, 
      populates the graph up to the beginning'''

    feedingXObservation = restoredGraph.get_tensor_by_name('main_level/agent/main/online/Placeholder:0')

The problem here is that the we need to know 1) the name of the tensor that feeds the data at the beggining to the NN architecture used in each agent and 2) the last operation that outputs the action values (or its probabilities in the case of the Rainbow algorithm for example) so that we can feed the new observation to make inference.

The print_networks_summary=True in the VisualizationParameters gives some hint about what to look for. However, there is no clarity on how to go about this. For example, let's say that as most of us we want to get the first placeholder to feed the observation and for the last operation to get the tensor (in the example of a Rainbow agent, being one of the most complex, we have the following architecture:)

Network: main, Copies: 2 (online network | target network)
----------------------------------------------------------
Input Embedder: observation
        Input size = [163]
        Noisy Dense (num outputs = 256)
        Activation (type = <function relu at 0x7f90a65608c8>)
Middleware:
        No layers
Output Head: rainbow_q_values_head
        State Value Stream - V
                Dense (num outputs = 512)
                Dense (num outputs = 51)
        Action Advantage Stream - A
                Dense (num outputs = 512)
                Dense (num outputs = 153)
                Reshape (new size = 3 x 51)
                Subtract(A, Mean(A))
        Add (V, A)
        Softmax

Let's look for the first placeholder and for the last softmax layer.

print([n.name for n in restoredGraph.as_graph_def().node if 'Softmax' in n.op]) gives:

['main_level/agent/main/online/network_0/rainbow_q_values_head_0/Softmax', 'main_level/agent/main/online/network_0/rainbow_q_values_head_0/softmax_cross_entropy_with_logits_sg', 'main_level/agent/main/online/network_0/rainbow_q_values_head_0/softmax_1', 'main_level/agent/main/online/gradients/main_level/agent/main/online/network_0/rainbow_q_values_head_0/softmax_cross_entropy_with_logits_sg_grad/LogSoftmax', 'main_level/agent/main/target/network_0/rainbow_q_values_head_0/Softmax', 'main_level/agent/main/target/network_0/rainbow_q_values_head_0/softmax_cross_entropy_with_logits_sg', 'main_level/agent/main/target/network_0/rainbow_q_values_head_0/softmax_1', 'main_level/agent/main/target/gradients/main_level/agent/main/target/network_0/rainbow_q_values_head_0/softmax_cross_entropy_with_logits_sg_grad/LogSoftmax']

and looking for the first placeholder like:

print([n.name for n in restoredGraph.as_graph_def().node if 'Placeholder' in n.op]) gives:

['main_level/agent/main/online/Placeholder', 'main_level/agent/main/online/network_0/observation/observation', 'main_level/agent/main/online/network_0/gradients_from_head_0-0_rescalers_1', 'main_level/agent/main/online/network_0/rainbow_q_values_head_0/distributions', 'main_level/agent/main/online/network_0/rainbow_q_values_head_0/rainbow_q_values_head_0_importance_weight', 'main_level/agent/main/online/0_holder', 'main_level/agent/main/online/1_holder', 'main_level/agent/main/online/2_holder', 'main_level/agent/main/online/3_holder', 'main_level/agent/main/online/4_holder', 'main_level/agent/main/online/5_holder', 'main_level/agent/main/online/6_holder', 'main_level/agent/main/online/7_holder', 'main_level/agent/main/online/8_holder', 'main_level/agent/main/online/9_holder', 'main_level/agent/main/online/10_holder', 'main_level/agent/main/online/11_holder', 'main_level/agent/main/online/12_holder', 'main_level/agent/main/online/13_holder', 'main_level/agent/main/online/14_holder', 'main_level/agent/main/online/15_holder', 'main_level/agent/main/online/16_holder', 'main_level/agent/main/online/17_holder', 'main_level/agent/main/online/18_holder', 'main_level/agent/main/online/19_holder', 'main_level/agent/main/online/20_holder', 'main_level/agent/main/online/output_gradient_weights', 'main_level/agent/main/target/Placeholder', 'main_level/agent/main/target/network_0/observation/observation', 'main_level/agent/main/target/network_0/gradients_from_head_0-0_rescalers_1', 'main_level/agent/main/target/network_0/rainbow_q_values_head_0/distributions', 'main_level/agent/main/target/network_0/rainbow_q_values_head_0/rainbow_q_values_head_0_importance_weight', 'main_level/agent/main/target/0_holder', 'main_level/agent/main/target/1_holder', 'main_level/agent/main/target/2_holder', 'main_level/agent/main/target/3_holder', 'main_level/agent/main/target/4_holder', 'main_level/agent/main/target/5_holder', 'main_level/agent/main/target/6_holder', 'main_level/agent/main/target/7_holder', 'main_level/agent/main/target/8_holder', 'main_level/agent/main/target/9_holder', 'main_level/agent/main/target/10_holder', 'main_level/agent/main/target/11_holder', 'main_level/agent/main/target/12_holder', 'main_level/agent/main/target/13_holder', 'main_level/agent/main/target/14_holder', 'main_level/agent/main/target/15_holder', 'main_level/agent/main/target/16_holder', 'main_level/agent/main/target/17_holder', 'main_level/agent/main/target/18_holder', 'main_level/agent/main/target/19_holder', 'main_level/agent/main/target/20_holder', 'main_level/agent/main/target/output_gradient_weights', 'Placeholder', 'Placeholder_1', 'Placeholder_2', 'Placeholder_3', 'Placeholder_4', 'Placeholder_5', 'Placeholder_6', 'Placeholder_7', 'Placeholder_8', 'Placeholder_9', 'Placeholder_10', 'Placeholder_11', 'Placeholder_12', 'Placeholder_13', 'Placeholder_14', 'Placeholder_15', 'Placeholder_16', 'Placeholder_17', 'Placeholder_18', 'Placeholder_19', 'Placeholder_20', 'Placeholder_21', 'Placeholder_22', 'Placeholder_23', 'Placeholder_24', 'Placeholder_25', 'Placeholder_26', 'Placeholder_27', 'Placeholder_28', 'Placeholder_29', 'Placeholder_30', 'Placeholder_31', 'Placeholder_32', 'Placeholder_33', 'Placeholder_34', 'Placeholder_35', 'Placeholder_36', 'Placeholder_37', 'Placeholder_38', 'Placeholder_39', 'Placeholder_40', 'Placeholder_41', 'Placeholder_42', 'Placeholder_43', 'Placeholder_44', 'Placeholder_45', 'Placeholder_46', 'Placeholder_47', 'Placeholder_48', 'Placeholder_49', 'Placeholder_50', 'Placeholder_51', 'Placeholder_52', 'Placeholder_53', 'Placeholder_54', 'Placeholder_55', 'Placeholder_56', 'Placeholder_57', 'Placeholder_58', 'Placeholder_59', 'Placeholder_60', 'Placeholder_61', 'Placeholder_62', 'Placeholder_63', 'Placeholder_64', 'Placeholder_65', 'save/filename', 'save/Const']

So, my intuition is towards picking the Placeholder:0 and the 'main_level/agent/main/target/gradients/main_level/agent/main/target/network_0/rainbow_q_values_head_0/softmax_cross_entropy_with_logits_sg_grad/LogSoftmax:0' tensor by using get_tensor_by_name, but I'm not sure on how to interpret all that information and how to be certain.

I think that this feature is crucial so that the framework can complete the creation and development cycle and be further developed in a PR or at least upgraded not directly in rl_coach but with TF (my idea would be to just give explicit names to the tensors that are needed to make this happen > i.e. the first one and the final one).

¿Any thougts on this? @gal-leibovich @galnov and others I can try to help it happen on my side, but I don't know your ideas regarding this important core part of coach.

If there is another way to do it (I'm aware that it could be done loading all the coach framework, something like:)

### Create all the graph and then restore_checkpoint().
### Get the observation...

action_info = coach.graph_manager.get_agent().choose_action(observation)
print("State:{}, Action:{}".format(observation,action_info.action))

If that is possible and "the way" to go, it would be awesome to create a mini-tutorial on how to load a pretained model, once exited the training.

Thanks for the detailed explanation of what you are trying to achieve and the problems you encounter along the way! Our recommendation is to create a similar graph manager for inference as the one you used for training - this will make sure you are loading the stored checkpoint and feeding the data correctly. We just added another section to the Quick Start Guide tutorial (the last one - Advanced functionality) that shows how to evaluate a stored checkpoint. Please take a look and let us know if it does not address your problem.

Hey Gal,

Thanks for the thoughts and the help. I see the points on the tutorial, which makes straightforward to make inference happen. However, if I can be of any help here is that this doesn't fully build to make inference in production with custom environments.

As I see, the evaluate method takes the quantity of steps, that finally call the act mehtod, which actually doesn't return anything and the step function of the level_manager is called with a None parameter.

My concern here is that I would like to be able to pass a custom observation obtained from new and unknown states in my custom environments.

The workflow could be something like the exact same procedure as in the tutorial, but with the ability to pass a new observation (or not) to the evaluate method:

# Clearing the previous graph before creating the new one to avoid name conflicts
tf.reset_default_graph()

# Updating the graph manager's task parameters to restore the latest stored checkpoint from the checkpoints directory
task_parameters2 = TaskParameters()
task_parameters2.checkpoint_restore_path = my_checkpoint_dir

graph_manager.create_graph(task_parameters2)
# Option 1: to act in the same generated env states:
graph_manager.evaluate(EnvironmentSteps(5))

# Option 2: to act in a newly created env state, with a observation space of (27,1) for ex:
new_state = np.random.random((27,1))

# The call to evaluate would return the action performed:
# Example from Discrete(3) -> 0, 1, 2
played_action = graph_manager. evaluate(EnvironmentSteps(1), obs=observation, new_obs=True)

# This could be build up for more than one state, like:
for each_obs in new_states:

     played_action = graph_manager. evaluate(EnvironmentSteps(1), obs=each_obs, new_obs=True)

# Clearning up
shutil.rmtree(my_checkpoint_dir)

¿Have I explained myself better this time? The idea with this little adding could be to be able to infer from states not seem from the training environment, in a live production environment (with the same graph characteristics as the trained one, but with unseen states up to date).

If I can be of further help, just tell me.

You can replace the graph_manager.evaluate(...) call with: action_info = graph_manager.get_agent().choose_action(obs) env.step(action_info.action) as described in the "Agent functionality" section of the same tutorial above.

Hey Gal,

Yes, that was the thing I had in mind previously.

However, when I'm using the BasicRLGraphManager directly (without using the CoachInterface class as in the examples) I get an error saying that the BasicRLGraphManager object has no attribute 'get_agent'.

If I use the CoachInterface (¿can I pass my custom env to the preset in the CoachInterface class?), there are a number of attributes I cannot set directly (or maybe I could, but it would be convoluted? > I mean, my code is just like the presets with its env_params, agent_params, task_params, visualization_params...).

After doing all the same thing as in the presets I do:

graphManager = BasicRLGraphManager(agent_params=agentParameters,
                                                env_params=envParameters,
                                                schedule_params= SimpleScheduleWithoutEvaluation(),
                                                name='ZMTrade_Inference')

graphManager.create_graph(task_parameters=taskParams)

actionInfo = graphManager.get_agent().choose_action(observationToInferFrom)

AttributeError: 'BasicRLGraphManager' object has no attribute 'get_agent'

I suppose that, for the agent to work correctly on new environment states, I do need to pass the enviroment parameters to the graph so that it knows the shape of the new observation; so hence passing them to the graph I'm working on. However and as said, I'm unable to access the agent from the graphManager if I use the preset-like structure and not the CoachInterface. If I could access the agent like in my example, it would just perfect.

¿Can you shed some more light here @galnov to finally close this?

Many thanks for the time.

You should be able to use graph_manager.get_agent(). It was added to master after the latest pypi package was released, so make sure you're working with the latest master code and not an installed Coach package.

Here's an example for how it should be used:

env_params = GymVectorEnvironment(level='CartPole-v0')
env = GymEnvironment(**env_params.__dict__, visualization_parameters=VisualizationParameters())
graph_manager = BasicRLGraphManager(agent_params=agent_params, env_params=env_params,
                                    schedule_params=schedule_params, vis_params=VisualizationParameters())
graph_manager.create_graph(TaskParameters())
response = env.reset_internal_state()
for _ in range(10):
    action_info = graph_manager.get_agent().choose_action(response.next_state)
    response = env.step(action_info.action)
    print(response.reward)

Hi Gal,

Many thanks for the quick update.

I have just tried to upgrade rl-coach (with both options < rl_coach and rl-coach > that seem to be the same and in GitHub there is the guide to install it like rl_coach and in pypi is rl-coach), and found that I have the master version 0.12.1 ->

Command for reference: pip install rl_coach -U or pip install rl-coach -U

(Requirement already up-to-date: rl_coach in ./miniconda3/envs/coach_rl/lib/python3.7/site-packages (0.12.1) or (Requirement already up-to-date: rl-coach in ./miniconda3/envs/coach_rl/lib/python3.7/site-packages (0.12.1) and I still have the attribute error, both in your example and in mine.

My full code with your example:

### Import coach agents:
from rl_coach.agents.rainbow_dqn_agent import RainbowDQNAgentParameters

### Import coach environment:
from rl_coach.environments.gym_environment import GymVectorEnvironment, GymEnvironment

### Import coach graphManagers and schedules to run the graphs:
from rl_coach.graph_managers.basic_rl_graph_manager import BasicRLGraphManager
from rl_coach.graph_managers.graph_manager import SimpleScheduleWithoutEvaluation

### Import coach parameters classes:
from rl_coach.base_parameters import VisualizationParameters
from rl_coach.base_parameters import TaskParameters

if __name__ == "__main__":

    agent_params = RainbowDQNAgentParameters()

    env_params = GymVectorEnvironment(level='CartPole-v0')
    env = GymEnvironment(**env_params.__dict__, visualization_parameters=VisualizationParameters())
    graph_manager = BasicRLGraphManager(agent_params=agent_params, env_params=env_params,
                                    schedule_params=SimpleScheduleWithoutEvaluation(), vis_params=VisualizationParameters())
    graph_manager.create_graph(TaskParameters())
    response = env.reset_internal_state()
    for _ in range(10):
        action_info = graph_manager.get_agent().choose_action(response.next_state)
        response = env.step(action_info.action)
        print(response.reward)

¿Can you double check that the fix is added? If yes, ¿what else can be going wrong here? ¿Can it be something associated with the use of python 3.7?

I'm certainly using the BasicRLGraphManager, creating the graph afterwards with the TaskParameters and finally trying to get the agent. The unique difference that I have is that my env is a custom Gym one, but it should make the difference because the env_params are just like this:

env_params = GymVectorEnvironment(level=self.envPath)

Any further help to debug more is greatly appreciated.

Thanks again for the support,

Please follow the installation instructions from a cloned repository. You should be using pip3 install -e . instead of installing the pypi package.

Gosh...

I understood your previous comment wrongly, thinking that it was released IN the last pypi package, not just commited to the master branch. Sorry for that.

Now it works perfectly, with my custom env and with your example.

Many many thanks for the support, this is a huge thing to be able to try the models in production.

Thanks Gal and team!

I guess that, if there is no any other comment or insight, this could be closed.

@galnov, feel free to do it if you won't add anything else.

Thanks again and have a good day!

IntelLabs / coach

Further improvement of using trained agents in production #374