autonomousvision / sledge

[ECCV'24] SLEDGE: Synthesizing Driving Environments with Generative Models and Rule-Based Traffic
Apache License 2.0
161 stars 6 forks source link

Why is the program obstructed at 'worker = build_worker(cfg)'? It even takes 10+ hours without any response. #6

Closed caixxuan closed 3 weeks ago

caixxuan commented 1 month ago

Terminal shows:

Screenshot from 2024-09-21 10-21-19

And there is no other response, even I have waited for 10+ hours, I have to stopped it.

I can locate the code:

'worker = build_worker(cfg)' in run_autoencoder.py. this is the reason.

caixxuan commented 1 month ago

I change the worker to this: worker: single_machine_thread_pool #caixuan

It can train properly!

caixxuan commented 1 month ago

But I cannot monitor the run with tensorboard. http://localhost:6006/ seems cannot raise the tensorboard? Big thanks

DanielDauner commented 1 month ago

Hi

Both issues might be related since the ray library and tensorboard rely on local network services.

The issues might be dependent on your system and settings. I have previously encountered similar problems (with multi-GPU training), where PyTorch's distributed training couldn't resolve local addresses correctly. Are you running the code on a remote machine, local machine, or cluster?

Best, Daniel

caixxuan commented 1 month ago

Hi

Both issues might be related since the ray library and tensorboard rely on local network services.

  • The ray library is used in SLEDGE to paralyze (and speed up) the preprocessing and the simulation across threads in your system. You don't need it for training because PyTorch's dataloader has its own multi-process methods implemented (i.e. by setting num_workers. If you want to use simulation and preprocessing, we still recommend using the ray library because of significantly faster code execution. In your case, the initialization of ray workers appears to be interrupted in this line. Note that ray has a dashboard which uses the internal network.
  • Tensorboard also uses local networking, typically with a localhost address and a default port number 6006. Could you verify these are the designated values? Could you assign a different port number, i.e. by running tensorboard --logdir ./path/to/logs --port 8080?

The issues might be dependent on your system and settings. I have previously encountered similar problems (with multi-GPU training), where PyTorch's distributed training couldn't resolve local addresses correctly. Are you running the code on a remote machine, local machine, or cluster?

Best, Daniel

Thanks for your patient answer. I decide to solve this problem later. I have go along your instruction .md. But I got a new problem, that is, when I run "bash simple_simulation.sh", the simulation always fails:

Screenshot from 2024-09-21 20-00-36

and there is nothing in /exp/../simulation_log/PDMClosedPlanner/...

Screenshot from 2024-09-21 20-01-04

DanielDauner commented 1 month ago

Hi @caixxuan. At first glance, this looks like a bug. Does the error occur on all generated samples or only in a few scenarios?

Best, Daniel

caixxuan commented 1 month ago

all generated samples

It should all generated samples, because nothing in /exp/../simulation_log/PDMClosedPlanner/.../log/..., and when I run run_sledgeboard.py, no any scenarios appears.

DanielDauner commented 1 month ago

Does the error occur when running the simulation or the sledgeboard?

caixxuan commented 1 month ago

Does the error occur when running the simulation or the sledgeboard?

The error in the picture is running the simulation, I think no any scenario appears when running run_sledgeboard.py is related to the error.

caixxuan commented 1 month ago

Does the error occur when running the simulation or the sledgeboard?

The error in the picture is running the simulation, I think no any scenario appears when running run_sledgeboard.py is related to the error.

And I run it in my single PC. Any other config should be changed?

DanielDauner commented 1 month ago

Hi @caixxuan, sorry for the late reply! Is the issue still a problem?

caixxuan commented 1 month ago

Hi @caixxuan, sorry for the late reply! Is the issue still a problem?

Yes, I have not solve it now, Did I need to change any configurations on my single PC except: worker: single_machine_thread_pool ?

DanielDauner commented 1 month ago

So, currently, the issue(s) breaks down to

  1. Errors when using worker=ray_distributed.
  2. Errors in the simulation, i.e. when propagating the agents.
  3. Errors in sledgeboard, i.e. no scenarios.

Maybe for (1) you could try the following in this issue. That fix previously worked for me when running ray on a slurm cloud. (2) is a bug, which I will try to fix in an upcoming release. The problem (3) might be related to the failures in the simulation. Nevertheless, I previously had problems when visualizing simulation folders in the cloud from a local computer (i.e. with a mounted directory). This only worked if the folder path was equal on the cloud and mounted directory.

I hope that helps! Daniel

caixxuan commented 1 month ago

So, currently, the issue(s) breaks down to

1. Errors when using `worker=ray_distributed`.

2. Errors in the simulation, i.e. when propagating the agents.

3. Errors in sledgeboard, i.e. no scenarios.

Maybe for (1) you could try the following in this issue. That fix previously worked for me when running ray on a slurm cloud. (2) is a bug, which I will try to fix in an upcoming release. The problem (3) might be related to the failures in the simulation. Nevertheless, I previously had problems when visualizing simulation folders in the cloud from a local computer (i.e. with a mounted directory). This only worked if the folder path was equal on the cloud and mounted directory.

I hope that helps! Daniel

thanks a lot, i will try it again

DanielDauner commented 3 weeks ago

I am closing this issue for now. Feel free to re-open the issue (or open a new one), if you have further questions!

Best Daniel