davechurchill / StarcraftAITournamentManager

Tournament Manager Software for StarCraft AI Competitions
MIT License
77 stars 43 forks source link

"run_proxy.bat" may not be executed with a low but persistent possibility #30

Closed LukeZhuang closed 5 years ago

LukeZhuang commented 6 years ago

When we run our bot on clusters, we see a low but persistent crash rate of our bot, about 1 out of 50 games. After printing out the log, we found that the "run_proxy.bat" was not executed for those crashed games(but this file does exist, just not executed). We run a python model in "run_proxy.bat", which is essential to our bot.

We think this may caused by java Runtime environment in line 146 of /src/utility/WindowsCommandTools.java Process proc = Runtime.getRuntime().exec(windowsCommandPrefix + command); Too frequent commands may invalidate this one.

We did a test to simulate this situation, the code is published in https://github.com/LukeZhuang/test_proxy. We use Java Runtime to run the test "run_proxy.bat" for 200 times and wait for 10 seconds after each 2 times. After checking "output.txt", we found that "run_proxy.bat" was in fact executed for 100 times, just half of them was executed.

Potential Solution: We found that adding sleep command after excuting Java Runtime can solve this problem. For example: Process proc = Runtime.getRuntime().exec(windowsCommandPrefix + command); Thread.sleep(10); Then the crash rate is 0 after adding that. Probably it creates a latency so it make sure that each command will not be overrided. After adding this to our test simulation, we found that "run_proxy.bat" was executed for 160 times, more than before, but still not 200 times. So, we think we can conclude that this problem is caused by too frequent commands.

richard-kelly commented 6 years ago

Thanks for creating the test code. I am able to reproduce this in your test if I take out the python and just try to run a batch file that prints the time. I'm going to look into why it fails a bit more before putting in a fix.

richard-kelly commented 6 years ago

I changed the test code (see below) to print out the error stream from the process (only on even calls, otherwise the delay prevents the error) and found that the error message was:

"The process cannot access the file because it is being used by another process."

I thought this meant the problem was too many calls to access the same batch file, but if I use two different batch files for each pair of runs the problem still happens. If you change the code to produce 200 unique output files it will successfully create all of them. I believe the issue we're seeing with the test is just multiple processes trying to write to the same output file.

I don't know what the actual problem happening in the Tournament Manager is, then, since the process to run "run_proxy.bat" is only created once. The only other process created before that in the startStarCraft() method is one that renames the character files (ClientCommands.Client_RenameCharacterFile()).

Could you try adding some similar code to log the error message (if any) in your actual bot setup?

And did I understand correctly that adding the 10ms delay stopped the crashes completely?

public class Main{
    public synchronized static void runCommand(int id, boolean printErrors){
        try
        {
            String idString=Integer.toString(id+1);
            String command="D:\\test\\AI\\run_proxy.bat";
            String windowsCommandPrefix = "CMD /C ";
            String completeCommand = windowsCommandPrefix + command;
            System.out.println("\n" + idString+ " "+ completeCommand);
            Process proc = Runtime.getRuntime().exec(windowsCommandPrefix + command);
            if (printErrors) {
                BufferedReader br = new BufferedReader (new InputStreamReader(proc.getErrorStream()));
                String line;
                if ((line = br.readLine())!= null) { //or while to print multiple lines
                    System.out.println(line); //log to a file instead?
                }
            }
            //Thread.sleep(10);
        }
        catch (Exception e)
        {
            e.printStackTrace();
        }
    }

    public static void main(String[] args){
        try{
            for(int i=0;i<100;i++){
                runCommand(i, false);
                runCommand(i, true);
                //Thread.sleep(2);
            }
        }catch (Exception e)
        {
            e.printStackTrace();
        }
    }
}