dotchen / WorldOnRails

(ICCV 2021, Oral) RL and distillation in CARLA using a factorized world model
https://dotchen.github.io/world_on_rails/
MIT License
166 stars 29 forks source link

What's the right way to stop data_phase runs when enough data has been generated? #11

Closed aaronh65 closed 3 years ago

aaronh65 commented 3 years ago

Hey Dian,

It's me again - had a question about how to correctly stop the data_phase methods once enough data has been collected. The workflow I've been going with is: spin up CARLA servers with launch_carla.sh, and run python -m data_phase1 for example. I check the target directory that data is written to and call ray stop once enough has been generated. But, I'm not sure if this is okay.

I'm asking because I collected a data_phase1 dataset and ran data_phase2, but got the following:

(wor) [aaronhua@trinity-0-11 WorldOnRails]$ python -m rails.data_phase2 --num-workers=4
Traceback (most recent call last):
  File "/home/aaronhua/anaconda3/envs/wor/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/aaronhua/anaconda3/envs/wor/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/aaronhua/WorldOnRails/rails/data_phase2.py", line 59, in <module>
    main(args)
  File "/home/aaronhua/WorldOnRails/rails/data_phase2.py", line 13, in main
    total_frames = ray.get(dataset.num_frames.remote())
  File "/home/aaronhua/anaconda3/envs/wor/lib/python3.7/site-packages/ray/worker.py", line 1379, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::RemoteMainDataset.num_frames() (pid=13194, ip=10.1.1.11)
  File "python/ray/_raylet.pyx", line 422, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 456, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 459, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 415, in ray._raylet.execute_task.function_executor
  File "/home/aaronhua/WorldOnRails/rails/datasets/main_dataset.py", line 216, in __init__
    super().__init__(*args, **kwargs)
  File "/home/aaronhua/WorldOnRails/rails/datasets/main_dataset.py", line 123, in __init__
    n = int(txn.get('len'.encode()))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

I've double checked that the main_data_dir specified in config.yaml and the data-dir argument in rails/data_phase2 point to the correct directory, and each of the runs within the data directory are at least a couple tens of megabytes. Strangely, I was previously just Ctrl-C'ing the processes which seemed to work fine (running train_phase2 on a different set of data currently with no issue). I was under the impression that ray stop would be the "correct" way to stop data processes.

dotchen commented 3 years ago

This can happen if you terminate while the workers are in the middle of saving data to disk -- the lmdb files become corrupted. You can just print the file in the glob for loop and simply delete trajectories that are corrupted.

aaronh65 commented 3 years ago

Makes sense, the suggested fix works. Closing the issue