Closed caixxuan closed 3 weeks ago
I change the worker to this: worker: single_machine_thread_pool #caixuan
It can train properly!
But I cannot monitor the run with tensorboard. http://localhost:6006/ seems cannot raise the tensorboard? Big thanks
Hi
Both issues might be related since the ray library and tensorboard rely on local network services.
The ray library is used in SLEDGE to paralyze (and speed up) the preprocessing and the simulation across threads in your system. You don't need it for training because PyTorch's dataloader has its own multi-process methods implemented (i.e. by setting num_workers
. If you want to use simulation and preprocessing, we still recommend using the ray library because of significantly faster code execution. In your case, the initialization of ray workers appears to be interrupted in this line. Note that ray has a dashboard which uses the internal network.
Tensorboard also uses local networking, typically with a localhost
address and a default port number 6006
. Could you verify these are the designated values? Could you assign a different port number, i.e. by running tensorboard --logdir ./path/to/logs --port 8080
?
The issues might be dependent on your system and settings. I have previously encountered similar problems (with multi-GPU training), where PyTorch's distributed training couldn't resolve local addresses correctly. Are you running the code on a remote machine, local machine, or cluster?
Best, Daniel
Hi
Both issues might be related since the ray library and tensorboard rely on local network services.
- The ray library is used in SLEDGE to paralyze (and speed up) the preprocessing and the simulation across threads in your system. You don't need it for training because PyTorch's dataloader has its own multi-process methods implemented (i.e. by setting
num_workers
. If you want to use simulation and preprocessing, we still recommend using the ray library because of significantly faster code execution. In your case, the initialization of ray workers appears to be interrupted in this line. Note that ray has a dashboard which uses the internal network.- Tensorboard also uses local networking, typically with a
localhost
address and a default port number6006
. Could you verify these are the designated values? Could you assign a different port number, i.e. by runningtensorboard --logdir ./path/to/logs --port 8080
?The issues might be dependent on your system and settings. I have previously encountered similar problems (with multi-GPU training), where PyTorch's distributed training couldn't resolve local addresses correctly. Are you running the code on a remote machine, local machine, or cluster?
Best, Daniel
Thanks for your patient answer. I decide to solve this problem later. I have go along your instruction .md. But I got a new problem, that is, when I run "bash simple_simulation.sh", the simulation always fails:
and there is nothing in /exp/../simulation_log/PDMClosedPlanner/...
Hi @caixxuan. At first glance, this looks like a bug. Does the error occur on all generated samples or only in a few scenarios?
Best, Daniel
all generated samples
It should all generated samples, because nothing in /exp/../simulation_log/PDMClosedPlanner/.../log/..., and when I run run_sledgeboard.py, no any scenarios appears.
Does the error occur when running the simulation or the sledgeboard?
Does the error occur when running the simulation or the sledgeboard?
The error in the picture is running the simulation, I think no any scenario appears when running run_sledgeboard.py is related to the error.
Does the error occur when running the simulation or the sledgeboard?
The error in the picture is running the simulation, I think no any scenario appears when running run_sledgeboard.py is related to the error.
And I run it in my single PC. Any other config should be changed?
Hi @caixxuan, sorry for the late reply! Is the issue still a problem?
Hi @caixxuan, sorry for the late reply! Is the issue still a problem?
Yes, I have not solve it now, Did I need to change any configurations on my single PC except: worker: single_machine_thread_pool ?
So, currently, the issue(s) breaks down to
worker=ray_distributed
.Maybe for (1) you could try the following in this issue. That fix previously worked for me when running ray
on a slurm cloud. (2) is a bug, which I will try to fix in an upcoming release. The problem (3) might be related to the failures in the simulation. Nevertheless, I previously had problems when visualizing simulation folders in the cloud from a local computer (i.e. with a mounted directory). This only worked if the folder path was equal on the cloud and mounted directory.
I hope that helps! Daniel
So, currently, the issue(s) breaks down to
1. Errors when using `worker=ray_distributed`. 2. Errors in the simulation, i.e. when propagating the agents. 3. Errors in sledgeboard, i.e. no scenarios.
Maybe for (1) you could try the following in this issue. That fix previously worked for me when running
ray
on a slurm cloud. (2) is a bug, which I will try to fix in an upcoming release. The problem (3) might be related to the failures in the simulation. Nevertheless, I previously had problems when visualizing simulation folders in the cloud from a local computer (i.e. with a mounted directory). This only worked if the folder path was equal on the cloud and mounted directory.I hope that helps! Daniel
thanks a lot, i will try it again
I am closing this issue for now. Feel free to re-open the issue (or open a new one), if you have further questions!
Best Daniel
Terminal shows:
And there is no other response, even I have waited for 10+ hours, I have to stopped it.
I can locate the code:
'worker = build_worker(cfg)' in run_autoencoder.py. this is the reason.