CSGO play doesn't work with --compile & slow simulation otherwise

georgysavva commented 1 week ago

Hi, running python src/play.py --compile in the csgo branch doesn't work for me. Here is the error at end of the log I get:

Fetching 38 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 38/38 [00:00<00:00, 16089.60it/s]
/home/georgy/personal/diamond/src/agent.py:66: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  sd = torch.load(Path(path_to_ckpt), map_location=self.device)
Compiling models...

Environment actions:

w : up
d : right
a : left
s : down
⎵ : jump
left ctrl : crouch
left shift : walk
1 : weapon1
2 : weapon2
3 : weapon3
r : reload
up : camera_up
right : camera_right
left : camera_left
down : camera_down

Controls:

 m  : switch control (human/replay)
 .  : pause/unpause
 e  : step-by-step (when paused)
 ⏎  : reset env
Esc : quit

Press enter to start
E1110 10:36:04.857000 140695155761792 torch/fx/experimental/recording.py:281] [6/0] failed while running evaluate_expr(*(u0, None), **{'fx_node': None})
W1110 10:36:04.868000 140695155761792 torch/_dynamo/exc.py:210] [6/0_1] Backend compiler failed with a fake tensor exception at 
W1110 10:36:04.868000 140695155761792 torch/_dynamo/exc.py:210] [6/0_1]   File "/home/georgy/personal/diamond/src/models/diffusion/inner_model.py", line 48, in torch_dynamo_resume_in_forward_at_48
W1110 10:36:04.868000 140695155761792 torch/_dynamo/exc.py:210] [6/0_1]     assert act.ndim == 2 or (act.ndim == 3 and act.size(2) == self.act_emb[0].num_embeddings and set(act.unique().tolist()).issubset(set([0, 1])))
W1110 10:36:04.868000 140695155761792 torch/_dynamo/exc.py:210] [6/0_1] Adding a graph break.
/home/georgy/miniconda3/envs/diamond/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:150: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.

When running without --compile, it works, but the simulation is very slow and unresponsive (increasing the --fps parameter doesn't help). I run it on a Linux machine with 4 NVIDIA RTX A4000 and AMD Ryzen Threadripper PRO 3975WX 32-Cores CPU. Printing self.PlayEnv.WorldModelEnv.device inside the Game.run() loop returns cuda:0, so I suppose the simulation is happening on the GPU. What should I do to achieve the same level of simulation & responsiveness as you have in the videos on your website?

AdamJelley commented 1 week ago

Hi @georgysavva! We're not sure why the compilation is failing on your machine. We only tested on RTX 3090/4090, so it could be an issue with compilation on a RTX A4000. But the error message isn't very informative, so it's hard to tell without going deeper into the stack trace. However if the model is running on GPU, it should be fairly responsive even without compilation... You could try changing the config to use the fast config as described in the README (and is now the default) and hopefully that might increase the speed to an acceptable level given your hardware?

georgysavva commented 1 week ago

Hi, can you please elaborate on how to change the config file? Do I need to train the model after changing it? You said the fast config is the default one, so I guess I am already using it, no?

AdamJelley commented 3 days ago

Hi @georgysavva, to change the config you just need to change the world_model_env value in trainer.yaml to fast, as shown in the current version of the file (the default was changed since you originally posted the issue, so alternatively you could pull the latest changes). The settings in the fast config are all settings that can be changed at inference time (without re-training the model). Hope that helps?

eloialonso / diamond

CSGO play doesn't work with --compile & slow simulation otherwise #30