Closed zako42 closed 5 years ago
Posting a little more information for context, if that helps:
I ran the Tensorflow session using the frozen_graph_def.pb
file. Visualizing the graph in Tensorboard shows a bunch of nodes, but these are the ones I was looking at in particular:
vector_observation
Operation: Placeholder
Attributes(2)
dtype {"type": "DT_FLOAT"}
shape {"shape":{"dim":[{"size":-1}, {"size": 6}]}}
Inputs(0)
Outputs(1)
main_graph_0/hidden_0/MatMul
action_masks
Operation: Placeholder
Attributes(2)
dtype {"type":"DT_FLOAT"}
shape {"shape":{"dim":[{"size":-1},{"size":9}]}}
Inputs(0)
Outputs(1)
strided_slice_1
action
Operation: Identity
Attributes(1)
T {"type":"DT_FLOAT"}
Inputs(1)
concat_1
Outputs(0)
I plugged in numbers into the vector_observation
and action_masks
, ran the session, and then looked at action
to see what the result was. I was thinking the result would be the same value I get when calling AgentAction()
in my Unity agent (a discrete number from 0 to 8 in this case). However the output from running the tensorflow session manually was:
action Summary: [-21.1059818 -13.4949427 -0.000504382479 -10.6514177 -15.5597267 -7.64783669 -21.7325554 -19.7437592 -13.1423416]
while troubleshooting, I also printed out values for some other nodes:
value_estimate Summary: [0.545044422] is_continuous_control Summary: 0 action_output_shape Summary: 9 action_probs Summary: [-9.94507313 -1.98709774 11.5074854 0.856563628 -4.05288267 3.86015272 -11.020257 -8.31383801 -1.63445354]
I tried manually adding input masks when running the session, and the values of the action
array dropped to about -23 on the masked actions.
My guess is that I should be argmax-ing the return value of action
, and it seems that the action_probs
line up with the actions (they sort the same). This is just my guess though. I'm still trying to look through the ML Agents C# code to see if it does this kind of argmax, but haven't found anything yet.
It looks like the AgentAction
might be set by the code in DiscreteActionOutputApplier
class, using the Multinomial::Eval()
. I don't understand it yet, but looks promising.
Ok, sorry if this is a dumb question:
Looking at the Multinomial::Eval()
it is using a random number with the CDF it calculates with the action_probs. In my use case with the C++ simulation with a trained agent, I'm running inference only -- no training. So should I not use random number with CDF and just use the highest action probability then? I'm thinking the randomness is to allow for exploration? In reinforcement learning when we run inference, do we only exploit? Or should I still allow for exploring during inference? (Sorry for my ignorance, I'm still trying to learn these things)
Thank you for the discussion. We are closing this issue due to inactivity. Feel free to reopen it if you’d like to continue the discussion.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Hello, and thank you for all your hard work.
I have some existing C++ simulation software, and I'm trying to use Unity trained agents to control things in it (I re-created portions of the C++ simulation in Unity for training). The C++ simulation is using Tensorflow C++. So I am training agents in Unity ML Agents and then taking the frozen model pb file and loading it into the simulation using Tensorflow C++. Earlier, I tested a proof of concept using ML Agents version 0.5 and was able to load the pb, and run inference (I think) by running a session and providing Tensor mapped inputs for
vector_observations
andmasks
. Then I would take the Tensor output fromaction
. This worked (I think) ok, and I would receive the output action similarly to what I would get fromAgentAction()
in ML Agents. For example the agent output was a discrete value from 0 to 10, so the output from running the Tensorflow session would return something like [6].However a colleague trained an agent using ML Agents version 0.8, and when I try to load the pb and run with my proof of concept test, the output is now an array of values. The result is an array of 10 floating point values. I noticed that things have changed between version 0.5 and 0.8 with the introduction of Barracuda. I'm not sure what to do with the array. There are 10 discrete actions, and 10 floating point values, so my guess is to argmax them, but if anyone can shed some light on this I'd appreciate it!
I understand that my use case is probably not supported. If not, could someone point me in the direction of the code where the Tensorflow graph session output is processed which feeds the
AgentAction()
return value? Thanks in advance for any help you can offer!