Closed markpwoodward closed 5 years ago
I think this is a good idea, we have been thinking about something similar, I will keep you posted when we release something related.
I think this is a good idea, we have been thinking about something similar, I will keep you posted when we release something related.
I am going to reopen this issue so that we can use this to update once we get some more transaction on this.
@markpwoodward I was re-reading your results (which are very counter intuitive). One thought on the GPU computation: How many vCPUs does your gcloud machine have? I noticed that the processing power of the (GCP) CPU chip is lower than both of the lenovos?
I have a hunch that if you had identical performance CPUs on the GCP machine as the Lenovo machines, you should get equal performance. This assumes that the V100 is not being used efficiently for game rendering by Unity (and we can ignore the use of the GPU for learning for this discussion b/c we are doing PPO with a single Actor configuration which doesn't fill up the GPU buffer).
@eshvk oops, I didn't mean to close this issue. The gcloud machine has 96 vCPUs and 8 V100's :). and nothing else was running but this test. I agree with your hunch about the CPU being the limiting factor. Performance seems to track CPU speed, with a minor bump when rendering on the GPU vs. CPU for each platform. [Edit] Actually, not quite, the think station CPU has higher base and boost, but maybe the boost is disabled... My 2cents is to focus on the cloud systems which are more standardized[\Edit]
The example doesn't do any learning (PPO or otherwise). At least that is my intention. I was thinking that this benchmark would be just rendering, not training.
My 2 cents on the visual benchmark environment is that it should have a "batchSize" resetParameter, which dynamically creates Areas. Number of Areas is an important tradeoff, e.g. for a batch size of 32 you could run 8 environments, each with a 4 Areas, or 32 environments, each with 1 Area. And processing time doesn't grow 1:1 with the number of Areas. I did this for my own environment and found the sweet spot.
@markpwoodward With 96 vCPUs (IIRC, each vCPU corresponds to 1 hyperthread), I suppose what this means is that Unity is exploiting base cycle speed where the Lenovos are doing better. On the GPU, another thing that would be worthwhile investigating is the percentage of the GPU being used at any time. I am guessing a very small percentage of the V100 is being used.
Yes, I have been looking at benchmarking in the context of GCP with dockerized containers in isolated VMs b/c there are way more hidden parameters to deal with on on consumer laptops.
Also, my assumption was that rendering on the GPU would really shine when I increased the number of Areas vs. CPU rendering, but I did not experience that. Another plus for a benchmark that can change the number of Areas.
@eshvk Correct. It is only ~3% of the V100.
Does Xvfb work with environments that use a camera? When I try to run my env I get:
Xlib: extension "NV-GLX" missing on display ":99".
Traceback (most recent call last):
"The Unity environment took too long to respond. Make sure that :\n"
mlagents_envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that :
The environment does not need user interaction to launch
The Academy and the External Brain(s) are attached to objects in the Scene
The environment and the Python interface have compatible versions.
@nikola-j
Our docker set-up uses xvfb specifically for environments that use cameras. See here for the dockerfile: https://github.com/Unity-Technologies/ml-agents/blob/master/Dockerfile
@nikola-j Also take a look at the following issue + solution: https://github.com/Unity-Technologies/ml-agents/issues/1574
@awjuliani @markpwoodward Thanks guys, the #1574 solution worked!
Thanks for the suggestion. I've added it to our internal tracker with the ID MLA-73. I’m going to close this issue for now, but we’ll ping back with any updates.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I think it would be good to have a processing benchmark for mlagents. We could then try to improve it through driver/cuda/unity flags and code optimization.
An initial benchmark could be the GridWorld environment that ships with the SDK.
grid_world_speed_test.py:
Here are my initial numbers:
GPU based $ time python grid_world_speed_test.py
Xvfb based $ time xvfb-run -s "-screen 0 1024x768x24" python grid_world_speed_test.py
A concerning trend is that speeds seem to get slower as the GPU gets more powerful.