data_phase2 ray actor dies

aaronh65 commented 3 years ago

Hey Dian,

Trying to run data_phase2 and I get the following Ray error (seems to have issue with RemoteMainDataset constructor?). I did some debugging by replacing all the @ray.remote stuff and .remote() commands with the non-ray versions, and the code runs with no issue (although the progress bar didn't progress past 0 frames after a minute or two, not quite sure if it's supposed to take that long or not).

Did you ever see anything like this/know what I should do?

(wor) aaron@Aarons-Machine:~/workspace/carla/WorldOnRails$ RAY_PDB=1 python -m rails.data_phase2 --num-workers=12
Traceback (most recent call last):
  File "/home/aaron/anaconda3/envs/wor/lib/python3.7/runpy.py", line 193, in _run_module_as_main
2021-05-29 14:45:49,862 WARNING worker.py:1034 -- Traceback (most recent call last):
  File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/ray/function_manager.py", line 251, in get_execution_info
    info = self._function_execution_info[job_id][function_id]
KeyError: FunctionID(41f68a98bcf1c9ebc84e01b0819040089631493c)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 550, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 364, in ray._raylet.execute_task
  File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/ray/function_manager.py", line 256, in get_execution_info
    raise KeyError(message)
KeyError: 'Error occurs in get_execution_info: job_id: JobID(01000000), function_descriptor: {type=PythonFunctionDescriptor, module_name=rails.datasets.main_dataset, class_name=RemoteMainDataset, function_name=__init__, function_hash=084f10af-7af1-46d7-8dda-ada171c2aad9}. Message: FunctionID(41f68a98bcf1c9ebc84e01b0819040089631493c)'
An unexpected internal error occurred while the worker was executing a task.
    "__main__", mod_spec)
  File "/home/aaron/anaconda3/envs/wor/lib/python3.7/runpy.py", line 85, in _run_code
2021-05-29 14:45:49,862 WARNING worker.py:1034 -- A worker died or was killed while executing task ffffffffffffffffcb230a5701000000.
    exec(code, run_globals)
  File "/home/aaron/workspace/carla/WorldOnRails/rails/data_phase2.py", line 67, in <module>
    main(args)
  File "/home/aaron/workspace/carla/WorldOnRails/rails/data_phase2.py", line 13, in main
    total_frames = ray.get(dataset.num_frames.remote())
  File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/ray/worker.py", line 1381, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
(wor) aaron@Aarons-Machine:~/workspace/carla/WorldOnRails$

dotchen commented 3 years ago

Do you see GPU utilization when running in local mode?

Also you don't need to rewrite anything to debug... simply set the num_runners=1 and local_mode=True when initializing Ray.

dotchen commented 3 years ago

Hi @aaronh65 do you have any news on this? I just tried this on my machine with ray enabled with no problem. If you can run it in local mode it suggests it might be an ray issue and we can look further into this.

aaronh65 commented 3 years ago

Hey @dotchen, didn't have a chance to work on this over the weekend but I'll look at it in the next few days - I'll let you know what happens!

Thanks for the debug suggestion! I haven't used ray before so that's helpful :)

aaronh65 commented 3 years ago

@dotchen messed around with it today and got some strange behavior. I've described them briefly below. I'm running RAY_PDB=1 python -m rails.data_phase2 --num-workers=1 (btw the RAILS.md tells users to use a --num-runners argument rather than the correct --num-workers argument for this phase)

1. with ray local_mode=False The actor dies as described in the original post.

2. with ray local_mode=True Produces the following error

(wor) aaron@Aarons-Machine:/data/aaronhua/wor/data/main$ ray debug
2021-06-01 15:56:56,813 INFO scripts.py:193 -- Connecting to Ray instance at 192.168.1.138:6379.
2021-06-01 15:56:56,814 INFO worker.py:657 -- Connecting to existing Ray cluster at address: 192.168.1.138:6379
Active breakpoints:
0: python -m rails.data_phase2 --num-workers=1 | /home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/ray/actor.py:677
Traceback (most recent call last):

  File "python/ray/_raylet.pyx", line 456, in ray._raylet.execute_task

  File "python/ray/_raylet.pyx", line 459, in ray._raylet.execute_task

  File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task

  File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor

  File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/ray/function_manager.py", line 556, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)

  File "/home/aaron/workspace/carla/WorldOnRails/rails/rails.py", line 242, in __init__
    self._rails = RAILS(args)

  File "/home/aaron/workspace/carla/WorldOnRails/rails/rails.py", line 27, in __init__
    self.ego_model  = EgoModel(dt=1./args.fps*(args.num_repeat+1)).to(args.device)

  File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/torch/nn/modules/module.py", line 673, in to
    return self._apply(convert)

  File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)

  File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)

  File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/torch/nn/modules/module.py", line 409, in _apply
    param_applied = fn(param)

  File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/torch/nn/modules/module.py", line 671, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

  File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/torch/cuda/__init__.py", line 170, in _lazy_init
    torch._C._cuda_init()

RuntimeError: No CUDA GPUs are available

Enter breakpoint index or press enter to refresh:

3. with ray local_mode=True The only thing that changed here is I printed out torch.cuda.is_available() right at the beginning of rails.data_phase2's __main__ function (obviously to debug the above). For some reason, this makes it work and I successfully ran the script on a toy dataset of like 1000 frames in 2-3 minutes. See here - https://wandb.ai/aaronhuang/carla_data_phase2/runs/5flpwvwk?workspace=user-aaronhuang

dotchen commented 3 years ago

Can you try running the command using the prefix CUDA_VISIBLE_DEVICES=0,1,2,3,..?

aaronh65 commented 3 years ago

Adding CUDA_VISIBLE_DEVICES=0 makes local_mode=True run correctly on my local 1-GPU machine (as I just saw here) without the weird torch command I had to throw in! Turning off local mode still shows the actor unexpectedly died error, however

dotchen commented 3 years ago

Hmm what if you do 'local_mode=False' and 'num_runners=1'?

aaronh65 commented 3 years ago

That generates the behavior I describe at the end of the last comment I wrote - ray actor still dies. The command I run is RAY_PDB=1 python -m rails.data_phase2 --num-workers=1 and local_mode=False in ray.init

dotchen commented 3 years ago

This is super odd, can you try tuning the ray.remote decorator in the action labeler? e.g. make num_cpu and num_gpu larger. Also, what is your ray version and system spec?

aaronh65 commented 3 years ago

I'll play around with the ray.remote decorator parameters. My ray version (running conda list | grep ray) is 1.1.0, which matches environment.yaml. I threw some print statements for debugging, and it looks like issue occurs before you enter the loop that creates num_workers action labelers, specifically at ray.get(dataset.num_frames.remote()) . See the below screenshots and output. Not sure exactly what's going on, but looks like an issue in RemoteMainDataset constructor?

(wor) aaron@Aarons-Machine:~/workspace/carla/WorldOnRails$ RAY_PDB=1 python -m rails.data_phase2 --num-workers=1
start
after logger
after remote main dataset
Traceback (most recent call last):
  File "/home/aaron/anaconda3/envs/wor/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/aaron/anaconda3/envs/wor/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/aaron/workspace/carla/WorldOnRails/rails/data_phase2.py", line 68, in <module>
    main(args)
  File "/home/aaron/workspace/carla/WorldOnRails/rails/data_phase2.py", line 16, in main
    total_frames = ray.get(dataset.num_frames.remote())
  File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/ray/worker.py", line 1381, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
2021-06-02 11:51:33,349 WARNING worker.py:1034 -- Traceback (most recent call last):
  File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/ray/function_manager.py", line 251, in get_execution_info
    info = self._function_execution_info[job_id][function_id]
KeyError: FunctionID(6988ca09b595b4bd83e008377797363da1e47172)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 550, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 364, in ray._raylet.execute_task
  File "/home/aaron/anaconda3/envs/wor/lib/python3.7/site-packages/ray/function_manager.py", line 256, in get_execution_info
    raise KeyError(message)
KeyError: 'Error occurs in get_execution_info: job_id: JobID(01000000), function_descriptor: {type=PythonFunctionDescriptor, module_name=rails.datasets.main_dataset, class_name=RemoteMainDataset, function_name=__init__, function_hash=653032be-c972-4a42-8956-b7596068a22d}. Message: FunctionID(6988ca09b595b4bd83e008377797363da1e47172)'
An unexpected internal error occurred while the worker was executing a task.
2021-06-02 11:51:33,349 WARNING worker.py:1034 -- A worker died or was killed while executing task ffffffffffffffffcb230a5701000000.
(wor) aaron@Aarons-Machine:~/workspace/carla/WorldOnRails$

dotchen commented 3 years ago

Now I feels like this could be a pytorch version issue. Can you try this with pytorch==1.4.0?

dotchen commented 3 years ago

Okay, I am pretty sure it's pytorch... Tried with 1.8.1 and I am seeing the same error as you, but 1.4.0 is good. They probably changed how parallel stuff works in the later version. I will change the environment.yaml to enforce version then...

Thanks a lot to reporting this! This is indeed a very sneaky issue.

dotchen commented 3 years ago

BTW feel free to reopen this if this is not the case for you @aaronh65

aaronh65 commented 3 years ago

That worked! Thanks a ton :)

dotchen / WorldOnRails

data_phase2 ray actor dies #6