Closed aaronh65 closed 3 years ago
Do you see GPU utilization when running in local mode?
Also you don't need to rewrite anything to debug... simply set the num_runners=1
and local_mode=True
when initializing Ray.
Hi @aaronh65 do you have any news on this? I just tried this on my machine with ray enabled with no problem. If you can run it in local mode it suggests it might be an ray issue and we can look further into this.
Hey @dotchen, didn't have a chance to work on this over the weekend but I'll look at it in the next few days - I'll let you know what happens!
Thanks for the debug suggestion! I haven't used ray before so that's helpful :)
@dotchen messed around with it today and got some strange behavior. I've described them briefly below. I'm running RAY_PDB=1 python -m rails.data_phase2 --num-workers=1
(btw the RAILS.md
tells users to use a --num-runners
argument rather than the correct --num-workers
argument for this phase)
1. with ray local_mode=False
The actor dies as described in the original post.
2. with ray local_mode=True
Produces the following error
(wor) aaron@Aarons-Machine:/data/aaronhua/wor/data/main$ ray debug
2021-06-01 15:56:56,813 INFO scripts.py:193 -- Connecting to Ray instance at 192.168.1.138:6379.
2021-06-01 15:56:56,814 INFO worker.py:657 -- Connecting to existing Ray cluster at address: 192.168.1.138:6379
Active breakpoints:
0: python -m rails.data_phase2 --num-workers=1 | /home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/ray/actor.py:677
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 456, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 459, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor
File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/ray/function_manager.py", line 556, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File "/home/aaron/workspace/carla/WorldOnRails/rails/rails.py", line 242, in __init__
self._rails = RAILS(args)
File "/home/aaron/workspace/carla/WorldOnRails/rails/rails.py", line 27, in __init__
self.ego_model = EgoModel(dt=1./args.fps*(args.num_repeat+1)).to(args.device)
File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/torch/nn/modules/module.py", line 673, in to
return self._apply(convert)
File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply
module._apply(fn)
File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply
module._apply(fn)
File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/torch/nn/modules/module.py", line 409, in _apply
param_applied = fn(param)
File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/torch/nn/modules/module.py", line 671, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/torch/cuda/__init__.py", line 170, in _lazy_init
torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
Enter breakpoint index or press enter to refresh:
3. with ray local_mode=True
The only thing that changed here is I printed out torch.cuda.is_available()
right at the beginning of rails.data_phase2
's __main__
function (obviously to debug the above). For some reason, this makes it work and I successfully ran the script on a toy dataset of like 1000 frames in 2-3 minutes. See here - https://wandb.ai/aaronhuang/carla_data_phase2/runs/5flpwvwk?workspace=user-aaronhuang
Can you try running the command using the prefix CUDA_VISIBLE_DEVICES=0,1,2,3,..
?
Adding CUDA_VISIBLE_DEVICES=0
makes local_mode=True
run correctly on my local 1-GPU machine (as I just saw here) without the weird torch
command I had to throw in! Turning off local mode still shows the actor unexpectedly died error, however
Hmm what if you do 'local_mode=False' and 'num_runners=1'?
That generates the behavior I describe at the end of the last comment I wrote - ray actor still dies. The command I run is RAY_PDB=1 python -m rails.data_phase2 --num-workers=1
and local_mode=False
in ray.init
This is super odd, can you try tuning the ray.remote
decorator in the action labeler? e.g. make num_cpu
and num_gpu
larger. Also, what is your ray version and system spec?
I'll play around with the ray.remote
decorator parameters. My ray version (running conda list | grep ray
) is 1.1.0, which matches environment.yaml
. I threw some print statements for debugging, and it looks like issue occurs before you enter the loop that creates num_workers
action labelers, specifically at ray.get(dataset.num_frames.remote())
. See the below screenshots and output. Not sure exactly what's going on, but looks like an issue in RemoteMainDataset
constructor?
(wor) aaron@Aarons-Machine:~/workspace/carla/WorldOnRails$ RAY_PDB=1 python -m rails.data_phase2 --num-workers=1
start
after logger
after remote main dataset
Traceback (most recent call last):
File "/home/aaron/anaconda3/envs/wor/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/aaron/anaconda3/envs/wor/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/aaron/workspace/carla/WorldOnRails/rails/data_phase2.py", line 68, in <module>
main(args)
File "/home/aaron/workspace/carla/WorldOnRails/rails/data_phase2.py", line 16, in main
total_frames = ray.get(dataset.num_frames.remote())
File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/ray/worker.py", line 1381, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
2021-06-02 11:51:33,349 WARNING worker.py:1034 -- Traceback (most recent call last):
File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/ray/function_manager.py", line 251, in get_execution_info
info = self._function_execution_info[job_id][function_id]
KeyError: FunctionID(6988ca09b595b4bd83e008377797363da1e47172)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 550, in ray._raylet.task_execution_handler
File "python/ray/_raylet.pyx", line 364, in ray._raylet.execute_task
File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/ray/function_manager.py", line 256, in get_execution_info
raise KeyError(message)
KeyError: 'Error occurs in get_execution_info: job_id: JobID(01000000), function_descriptor: {type=PythonFunctionDescriptor, module_name=rails.datasets.main_dataset, class_name=RemoteMainDataset, function_name=__init__, function_hash=653032be-c972-4a42-8956-b7596068a22d}. Message: FunctionID(6988ca09b595b4bd83e008377797363da1e47172)'
An unexpected internal error occurred while the worker was executing a task.
2021-06-02 11:51:33,349 WARNING worker.py:1034 -- A worker died or was killed while executing task ffffffffffffffffcb230a5701000000.
(wor) aaron@Aarons-Machine:~/workspace/carla/WorldOnRails$
Now I feels like this could be a pytorch version issue. Can you try this with pytorch==1.4.0
?
Okay, I am pretty sure it's pytorch... Tried with 1.8.1
and I am seeing the same error as you, but 1.4.0
is good. They probably changed how parallel stuff works in the later version. I will change the environment.yaml
to enforce version then...
Thanks a lot to reporting this! This is indeed a very sneaky issue.
BTW feel free to reopen this if this is not the case for you @aaronh65
That worked! Thanks a ton :)
Hey Dian,
Trying to run
data_phase2
and I get the following Ray error (seems to have issue withRemoteMainDataset
constructor?). I did some debugging by replacing all the@ray.remote
stuff and.remote()
commands with the non-ray versions, and the code runs with no issue (although the progress bar didn't progress past 0 frames after a minute or two, not quite sure if it's supposed to take that long or not).Did you ever see anything like this/know what I should do?