Tensorflow graph output changed in newer versions?

zako42 commented 5 years ago

Hello, and thank you for all your hard work.

I have some existing C++ simulation software, and I'm trying to use Unity trained agents to control things in it (I re-created portions of the C++ simulation in Unity for training). The C++ simulation is using Tensorflow C++. So I am training agents in Unity ML Agents and then taking the frozen model pb file and loading it into the simulation using Tensorflow C++. Earlier, I tested a proof of concept using ML Agents version 0.5 and was able to load the pb, and run inference (I think) by running a session and providing Tensor mapped inputs for vector_observations and masks. Then I would take the Tensor output from action. This worked (I think) ok, and I would receive the output action similarly to what I would get from AgentAction() in ML Agents. For example the agent output was a discrete value from 0 to 10, so the output from running the Tensorflow session would return something like [6].

However a colleague trained an agent using ML Agents version 0.8, and when I try to load the pb and run with my proof of concept test, the output is now an array of values. The result is an array of 10 floating point values. I noticed that things have changed between version 0.5 and 0.8 with the introduction of Barracuda. I'm not sure what to do with the array. There are 10 discrete actions, and 10 floating point values, so my guess is to argmax them, but if anyone can shed some light on this I'd appreciate it!

I understand that my use case is probably not supported. If not, could someone point me in the direction of the code where the Tensorflow graph session output is processed which feeds the AgentAction() return value? Thanks in advance for any help you can offer!

zako42 commented 5 years ago

Posting a little more information for context, if that helps:

I ran the Tensorflow session using the frozen_graph_def.pb file. Visualizing the graph in Tensorboard shows a bunch of nodes, but these are the ones I was looking at in particular:

vector_observation Operation: Placeholder Attributes(2) dtype {"type": "DT_FLOAT"} shape {"shape":{"dim":[{"size":-1}, {"size": 6}]}} Inputs(0) Outputs(1) main_graph_0/hidden_0/MatMul

action_masks Operation: Placeholder Attributes(2) dtype {"type":"DT_FLOAT"} shape {"shape":{"dim":[{"size":-1},{"size":9}]}} Inputs(0) Outputs(1) strided_slice_1

action Operation: Identity Attributes(1) T {"type":"DT_FLOAT"} Inputs(1) concat_1 Outputs(0)

I plugged in numbers into the vector_observation and action_masks, ran the session, and then looked at action to see what the result was. I was thinking the result would be the same value I get when calling AgentAction() in my Unity agent (a discrete number from 0 to 8 in this case). However the output from running the tensorflow session manually was:

action Summary: [-21.1059818 -13.4949427 -0.000504382479 -10.6514177 -15.5597267 -7.64783669 -21.7325554 -19.7437592 -13.1423416]

while troubleshooting, I also printed out values for some other nodes:

value_estimate Summary: [0.545044422] is_continuous_control Summary: 0 action_output_shape Summary: 9 action_probs Summary: [-9.94507313 -1.98709774 11.5074854 0.856563628 -4.05288267 3.86015272 -11.020257 -8.31383801 -1.63445354]

I tried manually adding input masks when running the session, and the values of the action array dropped to about -23 on the masked actions.

My guess is that I should be argmax-ing the return value of action, and it seems that the action_probs line up with the actions (they sort the same). This is just my guess though. I'm still trying to look through the ML Agents C# code to see if it does this kind of argmax, but haven't found anything yet.

zako42 commented 5 years ago

It looks like the AgentAction might be set by the code in DiscreteActionOutputApplier class, using the Multinomial::Eval(). I don't understand it yet, but looks promising.

zako42 commented 5 years ago

Ok, sorry if this is a dumb question:

Looking at the Multinomial::Eval() it is using a random number with the CDF it calculates with the action_probs. In my use case with the C++ simulation with a trained agent, I'm running inference only -- no training. So should I not use random number with CDF and just use the highest action probability then? I'm thinking the randomness is to allow for exploration? In reinforcement learning when we run inference, do we only exploit? Or should I still allow for exploring during inference? (Sorry for my ignorance, I'm still trying to learn these things)

chriselion commented 5 years ago

Thank you for the discussion. We are closing this issue due to inactivity. Feel free to reopen it if you’d like to continue the discussion.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Unity-Technologies / ml-agents

Tensorflow graph output changed in newer versions? #2460