LARG / HFO

Half Field Offense in Robocup 2D Soccer
MIT License
231 stars 93 forks source link

Defense NPCS Agent Closes Randomly #62

Open suraj-nair-1 opened 6 years ago

suraj-nair-1 commented 6 years ago

I have a Python script where I train a multi-agent reinforcement learning model in the HFO environment, with two model controlled offensive agents, and one hardcoded goalie. In the script, two separate processes are launched which control the individual agents. At random times, the goalie (using the default Agent2D team) crashes giving the following error:

Something necessary closed (defense_npc_1), [exiting]

From what I see this error is thrown when HFO checks to see if all processes are live. Prior to this there are no errors, and everything appears to work. I have had this occur randomly, sometimes after 10 episodes, sometimes after 1000. The specific HFO launch command is

./bin/HFO --headless --frames-per-trial=500 --untouched-time=500 --fullstate --offense-agents=2 --defense-npcs=1 --no-logging

mhauskn commented 6 years ago

It looks like you're running things correctly. You're correct that if the Python script detects the goalie process has died, it will exit. The question is why the goalie is dying.

A couple things you try: 1) Remove the --no-logging flag and look at the generated log files log/*(.rcg|.rcl) and see if there is any error output from the goalie after it crashes. 2) If that doesn't work, add the --record flag and look at the resulting log/base_right-1.log file to see what errors the goalie reports.

suraj-nair-1 commented 6 years ago

Ok so I tried both of those things, but don't see any errors. The end of the base_right-1.log file looks like:

11575 4 M GameStatus 0 11575 4 M StateFeatures 1 1 0.999299 -0.0374288 -1 0.999968 0.00800399 1 -1 -1 -1 -1 -1 -0.999901 -0.0141028 -1 -0.996628 -0.0820581 -1 -0.998534 0.0541226 -1 -0.999832 -0.0183344 -1 -0.96972$ 11575 8 M /cs/ml/ddpgHFO/HFO/build/librcsc-prefix/src/librcsc/rcsc/player/player_agent.cpp: Turn(0.00) 11576 4 M GameStatus 0 11576 4 M StateFeatures 1 1 0.999462 -0.0327955 -1 0.999968 0.00800399 1 -1 -1 -1 -1 -1 -0.999901 -0.0141028 -1 -0.996628 -0.0820581 -1 -0.998534 0.0541226 -1 -0.999832 -0.0183344 -1 -0.96972$ 11576 8 M /cs/ml/ddpgHFO/HFO/build/librcsc-prefix/src/librcsc/rcsc/player/player_agent.cpp: Turn(0.00)

The end of the .rcg file looks like: (show 11575 ((b) 3.002 -13.5504 0 0) ((l 1) 0 0 -3 -37 0 0 0 0 (v h 120) (s 8000 1 1 130600) (c 0 0 1 0 0 1 0 0 0 0 0)) ((l 2) 0 0 -6 -37 0 0 0 0 (v h 90) (s 8000 1 1 130600) (c 0 0 0 0 0 0 0$ (show 11576 ((b) 3.002 -13.5504 0 0) ((l 1) 0 0 -3 -37 0 0 0 0 (v h 120) (s 8000 1 1 130600) (c 0 0 1 0 0 1 0 0 0 0 0)) ((l 2) 0 0 -6 -37 0 0 0 0 (v h 90) (s 8000 1 1 130600) (c 0 0 0 0 0 0 0$ (show 11577 ((b) 3.002 -13.5504 0 0) ((l 1) 0 0 -3 -37 0 0 0 0 (v h 120) (s 8000 1 1 130600) (c 0 0 1 0 0 1 0 0 0 0 0)) ((l 2) 0 0 -6 -37 0 0 0 0 (v h 90) (s 8000 1 1 130600) (c 0 0 0 0 0 0 0$ (msg 11577 1 "(result 201801020959 base_left_0-vs-base_right_0)")

and the end of the .rcl file looks like:

11575,0 (referee IN_GAME-U-1) 11575,0 Recv base_right_1: (turn 0)(turn_neck -0)(attentionto our 7)(done) 11575,0 Recv base_left_7: (turn_neck -105)(done) 11575,0 Recv base_left_11: (turn_neck -94)(done) 11576,0 (referee IN_GAME-U-1) 11576,0 Recv base_right_1: (turn 0)(turn_neck -94)(attentionto our 8)(done) 11576,0 Recv base_left_7: (turn_neck -74)(done) 11576,0 Recv base_left_11: (turn_neck 174)(done) 11577,0 (referee IN_GAME-U-1)

Is anything here out of the ordinary? I also tried with the helios team goalie and still had the error.

mhauskn commented 6 years ago

Doesn't look out of the ordinary. Can you recreate the error using --offense-npcs instead of your agents?

mhauskn commented 6 years ago

If you want to get your hands dirty, you can modify the code that starts the NPCs and redirect the output from the NPC process to a file. See:

https://github.com/LARG/HFO/blob/master/bin/Teams.py#L47

Try redirecting stderr+stdout to a file of your choice and see if we get any error messages.

suraj-nair-1 commented 6 years ago

Ok so I tried using offense-npcs and there was no error. One thing I am noticing is that the error usually appears when my agents take a while to make a move (for example when the policy is being updated). Is it the case that if the environment is not stepped for ~5 seconds, that the defensive agent might close?

mhauskn commented 6 years ago

It's possible that could be the case, although I've never had problems with it. You could test by adding sleeps to one of the example offense agents.

cshNtu commented 4 years ago

Ok so I tried using offense-npcs and there was no error. One thing I am noticing is that the error usually appears when my agents take a while to make a move (for example when the policy is being updated). Is it the case that if the environment is not stepped for ~5 seconds, that the defensive agent might close?

Hi Suraj, for each team, there is a player.conf file. Inside that file, there is a setting called "server_wait_seconds". The default value for that setting is 5. That's why the agent will close if your single move takes more than 5 seconds. Set a larger value will resolve this problem. BTW, the files related to teams are under bin/teams/.

arminsadreddin commented 4 years ago

I have a Python script where I train a multi-agent reinforcement learning model in the HFO environment, with two model controlled offensive agents, and one hardcoded goalie. In the script, two separate processes are launched which control the individual agents. At random times, the goalie (using the default Agent2D team) crashes giving the following error:

Something necessary closed (defense_npc_1), [exiting]

From what I see this error is thrown when HFO checks to see if all processes are live. Prior to this there are no errors, and everything appears to work. I have had this occur randomly, sometimes after 10 episodes, sometimes after 1000. The specific HFO launch command is

./bin/HFO --headless --frames-per-trial=500 --untouched-time=500 --fullstate --offense-agents=2 --defense-npcs=1 --no-logging

I have the exact same problem. I am training a reinforcement learning model and after usually 30000 cycles I get this error. I I understood some thing.It is because of my batch size ! I saved my reply buffer in a file and after it got killed, I load the buffer again and in the first cycle the player got killed again ! So it depends on the size of data. After ward I just changed my batch size. I made it smaller and the problem got solved but this is not what I am looking for. I was wondering as rcssserver is running on Sync_mode is there any time limitation ? ( cause it should not be) I think when my model took a long time to get the answer player get killed. I am really worried about this problem as my thesis is based on it. HELP PLEASE ! Thanks

mhauskn commented 4 years ago

Following @cshNtu, I've increased the server_wait_seconds in the player.conf file to 300. This should prevent the NPCs from disconnecting if your model takes <5mins to submit an action. Let me know if this helps.

arminsadreddin commented 4 years ago

Following @cshNtu, I've increased the server_wait_seconds in the player.conf file to 300. This should prevent the NPCs from disconnecting if your model takes <5mins to submit an action. Let me know if this helps.

Thanks for your answer. I changed the player.conf files as you said but i still have the same problem. Is there any change needed in the rcssserver or HFO files ?

chen1087 commented 4 years ago

Following @cshNtu, I've increased the server_wait_seconds in the player.conf file to 300. This should prevent the NPCs from disconnecting if your model takes <5mins to submit an action. Let me know if this helps.

Thanks for your answer. I changed the player.conf files as you said but i still have the same problem. Is there any change needed in the rcssserver or HFO files ?

Hi arminsadreddin,

If you get the error: "Something necessary closed (defense_npc_1), [exiting]", it means the NPC, i.e. Agent2D agent, closes for some reasons. Could you try to set the "server_wait_seconds" with different values and check whether different settings allow you to handle different data buffer sizes? If the setting does have an impact, then you just set the "server_wait_seconds" to a very large value.

arminsadreddin commented 4 years ago

Following @cshNtu, I've increased the server_wait_seconds in the player.conf file to 300. This should prevent the NPCs from disconnecting if your model takes <5mins to submit an action. Let me know if this helps.

Thanks for your answer. I changed the player.conf files as you said but i still have the same problem. Is there any change needed in the rcssserver or HFO files ?

Hi arminsadreddin,

If you get the error: "Something necessary closed (defense_npc_1), [exiting]", it means the NPC, i.e. Agent2D agent, closes for some reasons. Could you try to set the "server_wait_seconds" with different values and check whether different settings allow you to handle different data buffer sizes? If the setting does have an impact, then you just set the "server_wait_seconds" to a very large value.

I did the same but it didnt work. there was no impact ! so I am wondering maybe I should change something else too.

chen1087 commented 4 years ago

Following @cshNtu, I've increased the server_wait_seconds in the player.conf file to 300. This should prevent the NPCs from disconnecting if your model takes <5mins to submit an action. Let me know if this helps.

Thanks for your answer. I changed the player.conf files as you said but i still have the same problem. Is there any change needed in the rcssserver or HFO files ?

Hi arminsadreddin, If you get the error: "Something necessary closed (defense_npc_1), [exiting]", it means the NPC, i.e. Agent2D agent, closes for some reasons. Could you try to set the "server_wait_seconds" with different values and check whether different settings allow you to handle different data buffer sizes? If the setting does have an impact, then you just set the "server_wait_seconds" to a very large value.

I did the same but it didnt work. there was no impact ! so I am wondering maybe I should change something else too.

Do you use a coach? There is a "server_wait_seconds" setting in the coach.conf file.

cshNtu commented 4 years ago

Following @cshNtu, I've increased the server_wait_seconds in the player.conf file to 300. This should prevent the NPCs from disconnecting if your model takes <5mins to submit an action. Let me know if this helps.

Thanks for your answer. I changed the player.conf files as you said but i still have the same problem. Is there any change needed in the rcssserver or HFO files ?

Hi arminsadreddin, If you get the error: "Something necessary closed (defense_npc_1), [exiting]", it means the NPC, i.e. Agent2D agent, closes for some reasons. Could you try to set the "server_wait_seconds" with different values and check whether different settings allow you to handle different data buffer sizes? If the setting does have an impact, then you just set the "server_wait_seconds" to a very large value.

I did the same but it didnt work. there was no impact ! so I am wondering maybe I should change something else too.

Please check which NPC agent is closed. Is it a goalie, a normal defender or a coach? Different roles may require different settings.