autonomousvision / tuplan_garage

[CoRL'23] Parting with Misconceptions about Learning-based Vehicle Motion Planning
Other
489 stars 55 forks source link

Ray problem before the start of training #14

Closed Strauss-Wen closed 1 year ago

Strauss-Wen commented 1 year ago

Problem

Hello. I have set up the nuplan environment and installed tuplan_garage as a package, followed every step for the preparation in the readme.md. However, when I tried to train the model, I have encountered a fatal Ray error. Every time after 'ray objects' is finished, it soon failed to start the dashboard, causing the program to 'ray objects' again. Because of the failure to initialize the ray instance, there is no log recording the error. I have searched a similar issue here but of little help. Thank you for the assistance.

Reproduce

bash the code file below:

TRAIN_EPOCHS=100
TRAIN_LR=1e-4
TRAIN_LR_MILESTONES=[50,75]
TRAIN_LR_DECAY=0.1
BATCH_SIZE=64
SEED=0

JOB_NAME=training_pdm_open_model
CACHE_PATH=/mnt/cfs/algorithm/linqing.zhao/haozhe/Tuplan_garage/cache
USE_CACHE_WITHOUT_DATASET=False

source ~/.bashrc
conda activate nuplan
python $NUPLAN_DEVKIT_ROOT/nuplan/planning/script/run_training.py \
seed=$SEED \
py_func=train \
+training=training_pdm_open_model \
job_name=$JOB_NAME \
scenario_builder=nuplan \
cache.cache_path=$CACHE_PATH \
cache.use_cache_without_dataset=$USE_CACHE_WITHOUT_DATASET \
lightning.trainer.params.max_epochs=$TRAIN_EPOCHS \
data_loader.params.batch_size=$BATCH_SIZE \
optimizer.lr=$TRAIN_LR \
lr_scheduler=multistep_lr \
lr_scheduler.milestones=$TRAIN_LR_MILESTONES \
lr_scheduler.gamma=$TRAIN_LR_DECAY \
hydra.searchpath="[pkg://tuplan_garage.planning.script.config.common, pkg://tuplan_garage.planning.script.config.training, pkg://tuplan_garage.planning.script.experiments, pkg://nuplan.planning.script.config.common, pkg://nuplan.planning.script.experiments]"

Output

Global seed set to 0
2023-09-16 11:08:48,865 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/folder_builder.py:20}  Building experiment folders...
2023-09-16 11:08:48,868 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/folder_builder.py:22}  Experimental folder: /mnt/cfs/algorithm/linqing.zhao/haozhe/Tuplan_garage/exp/exp/training_pdm_open_model/training_pdm_open_model/2023.09.16.11.08.46
2023-09-16 11:08:48,868 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/worker_pool_builder.py:19}  Building WorkerPool...
2023-09-16 11:08:48,870 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/utils/multithreading/worker_ray.py:78}  Starting ray local!
2023-09-16 11:08:52,865 INFO worker.py:1621 -- Started a local Ray instance.
2023-09-16 11:08:58,481 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/utils/multithreading/worker_pool.py:101}  Worker: RayDistributed
2023-09-16 11:08:58,482 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/utils/multithreading/worker_pool.py:102}  Number of nodes: 1
Number of CPUs per node: 96
Number of GPUs per node: 8
Number of threads across all nodes: 96
2023-09-16 11:08:58,482 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/worker_pool_builder.py:27}  Building WorkerPool...DONE!
2023-09-16 11:08:58,482 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/training/experiments/training.py:41}  Building training engine...
2023-09-16 11:08:58,483 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/model_builder.py:18}  Building TorchModuleWrapper...
2023-09-16 11:08:59,487 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/model_builder.py:21}  Building TorchModuleWrapper...DONE!
2023-09-16 11:08:59,488 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/splitter_builder.py:18}  Building Splitter...
2023-09-16 11:09:00,464 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/splitter_builder.py:21}  Building Splitter...DONE!
2023-09-16 11:09:00,465 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/scenario_building_builder.py:18}  Building AbstractScenarioBuilder...
2023-09-16 11:09:00,988 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/scenario_building_builder.py:21}  Building AbstractScenarioBuilder...DONE!
2023-09-16 11:09:00,988 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/scenario_filter_builder.py:35}  Building ScenarioFilter...
2023-09-16 11:09:00,989 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/scenario_filter_builder.py:44}  Building ScenarioFilter...DONE!
Ray objects: 100%|██████████████████████████████████████████████████████████████████████████████| 96/96 [13:16<00:00,  8.29s/it]
2023-09-16 11:22:25,347 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/scenario_builder.py:171}  Extracted 177435 scenarios for training
2023-09-16 11:22:25,347 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/utils/utils_config.py:258}  WORLD_SIZE was not set.
2023-09-16 11:22:25,348 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/utils/utils_config.py:266}  PytorchLightning Trainer gpus was set to -1, finding number of GPUs used from torch.cuda.device_count().
2023-09-16 11:22:25,348 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/utils/utils_config.py:277}  Number of gpus found to be in use: 8
2023-09-16 11:22:25,348 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/utils/utils_config.py:114}  World size: 8
2023-09-16 11:22:25,348 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/utils/utils_config.py:115}  Learning rate before: 0.0001
2023-09-16 11:22:25,348 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/utils/utils_config.py:119}  Scaling method: Equal Variance
2023-09-16 11:22:25,349 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/utils/utils_config.py:141}  Betas after scaling: [0.7422979694372631, 0.9971741579476155]
2023-09-16 11:22:25,349 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/utils/utils_config.py:143}  Learning rate after scaling: 0.000282842712474619
2023-09-16 11:22:25,478 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/utils/utils_config.py:172}  Updating Learning Rate Scheduler Config...
2023-09-16 11:22:25,478 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/utils/utils_config.py:258}  WORLD_SIZE was not set.
2023-09-16 11:22:25,478 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/utils/utils_config.py:266}  PytorchLightning Trainer gpus was set to -1, finding number of GPUs used from torch.cuda.device_count().
2023-09-16 11:22:25,479 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/utils/utils_config.py:277}  Number of gpus found to be in use: 8
2023-09-16 11:22:25,479 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/utils/utils_config.py:199}  Updating torch.optim.lr_scheduler.MultiStepLR in ddp setting is not yet supported. Learning rate scheduler config will not be updated.
2023-09-16 11:22:25,479 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/utils/utils_config.py:245}  Optimizer and LR Scheduler configs updated according to ddp strategy.
2023-09-16 11:22:25,503 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/training_callback_builder.py:19}  Building callbacks...
2023-09-16 11:22:25,538 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/training_callback_builder.py:37}  Building callbacks...DONE!
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
2023-09-16 11:22:25,539 INFO {/home/linqing.zhao/nuplan-devkit//nuplan/planning/script/run_training.py:62}  Starting training...
Global seed set to 0
2023-09-16 11:22:39,118 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/folder_builder.py:20}  Building experiment folders...
2023-09-16 11:22:39,121 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/folder_builder.py:22}  Experimental folder: /mnt/cfs/algorithm/linqing.zhao/haozhe/Tuplan_garage/exp/exp/training_pdm_open_model/training_pdm_open_model/2023.09.16.11.22.38
2023-09-16 11:22:39,121 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/worker_pool_builder.py:19}  Building WorkerPool...
2023-09-16 11:22:39,123 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/utils/multithreading/worker_ray.py:78}  Starting ray local!
Global seed set to 0
2023-09-16 11:22:41,279 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/folder_builder.py:20}  Building experiment folders...
2023-09-16 11:22:41,281 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/folder_builder.py:22}  Experimental folder: /mnt/cfs/algorithm/linqing.zhao/haozhe/Tuplan_garage/exp/exp/training_pdm_open_model/training_pdm_open_model/2023.09.16.11.22.40
2023-09-16 11:22:41,281 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/worker_pool_builder.py:19}  Building WorkerPool...
2023-09-16 11:22:41,283 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/utils/multithreading/worker_ray.py:78}  Starting ray local!
2023-09-16 11:22:42,819 INFO worker.py:1621 -- Started a local Ray instance.
Global seed set to 0
2023-09-16 11:22:45,132 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/folder_builder.py:20}  Building experiment folders...
2023-09-16 11:22:45,138 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/folder_builder.py:22}  Experimental folder: /mnt/cfs/algorithm/linqing.zhao/haozhe/Tuplan_garage/exp/exp/training_pdm_open_model/training_pdm_open_model/2023.09.16.11.22.44
2023-09-16 11:22:45,138 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/worker_pool_builder.py:19}  Building WorkerPool...
2023-09-16 11:22:45,140 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/utils/multithreading/worker_ray.py:78}  Starting ray local!
2023-09-16 11:22:45,659 INFO worker.py:1621 -- Started a local Ray instance.
Global seed set to 0
2023-09-16 11:22:49,560 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/utils/multithreading/worker_pool.py:101}  Worker: RayDistributed
2023-09-16 11:22:49,560 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/utils/multithreading/worker_pool.py:102}  Number of nodes: 1
Number of CPUs per node: 96
Number of GPUs per node: 8
Number of threads across all nodes: 96
2023-09-16 11:22:49,561 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/worker_pool_builder.py:27}  Building WorkerPool...DONE!
2023-09-16 11:22:49,561 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/training/experiments/training.py:41}  Building training engine...
2023-09-16 11:22:49,561 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/model_builder.py:18}  Building TorchModuleWrapper...
2023-09-16 11:22:49,782 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/folder_builder.py:20}  Building experiment folders...
2023-09-16 11:22:49,784 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/folder_builder.py:22}  Experimental folder: /mnt/cfs/algorithm/linqing.zhao/haozhe/Tuplan_garage/exp/exp/training_pdm_open_model/training_pdm_open_model/2023.09.16.11.22.49
2023-09-16 11:22:49,785 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/worker_pool_builder.py:19}  Building WorkerPool...
2023-09-16 11:22:49,787 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/utils/multithreading/worker_ray.py:78}  Starting ray local!
2023-09-16 11:22:50,106 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/model_builder.py:21}  Building TorchModuleWrapper...DONE!
2023-09-16 11:22:50,106 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/splitter_builder.py:18}  Building Splitter...
Global seed set to 0
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
2023-09-16 11:22:51,378 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/splitter_builder.py:21}  Building Splitter...DONE!
2023-09-16 11:22:51,379 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/scenario_building_builder.py:18}  Building AbstractScenarioBuilder...
2023-09-16 11:22:51,571 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/scenario_building_builder.py:21}  Building AbstractScenarioBuilder...DONE!
2023-09-16 11:22:51,571 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/scenario_filter_builder.py:35}  Building ScenarioFilter...
2023-09-16 11:22:51,573 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/scenario_filter_builder.py:44}  Building ScenarioFilter...DONE!
Ray objects:   0%|                                                                      | 0/96 [00:00<?, ?it/s]2023-09-16 11:22:54,601 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/utils/multithreading/worker_pool.py:101}  Worker: RayDistributed
2023-09-16 11:22:54,601 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/utils/multithreading/worker_pool.py:102}  Number of nodes: 1
Number of CPUs per node: 96
Number of GPUs per node: 8
Number of threads across all nodes: 96
2023-09-16 11:22:54,602 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/worker_pool_builder.py:27}  Building WorkerPool...DONE!
2023-09-16 11:22:54,602 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/training/experiments/training.py:41}  Building training engine...
2023-09-16 11:22:54,602 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/model_builder.py:18}  Building TorchModuleWrapper...
Global seed set to 0
2023-09-16 11:22:55,184 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/model_builder.py:21}  Building TorchModuleWrapper...DONE!
2023-09-16 11:22:55,184 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/splitter_builder.py:18}  Building Splitter...
2023-09-16 11:22:55,567 INFO worker.py:1621 -- Started a local Ray instance.
2023-09-16 11:22:55,599 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/folder_builder.py:20}  Building experiment folders...
2023-09-16 11:22:55,607 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/folder_builder.py:22}  Experimental folder: /mnt/cfs/algorithm/linqing.zhao/haozhe/Tuplan_garage/exp/exp/training_pdm_open_model/training_pdm_open_model/2023.09.16.11.22.55
2023-09-16 11:22:55,608 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/worker_pool_builder.py:19}  Building WorkerPool...
2023-09-16 11:22:55,610 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/utils/multithreading/worker_ray.py:78}  Starting ray local!
2023-09-16 11:22:55,752 INFO worker.py:1621 -- Started a local Ray instance.
2023-09-16 11:22:56,570 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/splitter_builder.py:21}  Building Splitter...DONE!
2023-09-16 11:22:56,571 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/scenario_building_builder.py:18}  Building AbstractScenarioBuilder...
2023-09-16 11:22:56,780 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/scenario_building_builder.py:21}  Building AbstractScenarioBuilder...DONE!
2023-09-16 11:22:56,780 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/scenario_filter_builder.py:35}  Building ScenarioFilter...
2023-09-16 11:22:56,782 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/scenario_filter_builder.py:44}  Building ScenarioFilter...DONE!
Ray objects:   0%|                                                                      | 0/96 [00:00<?, ?it/s]Global seed set to 0
2023-09-16 11:23:03,159 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/folder_builder.py:20}  Building experiment folders...
2023-09-16 11:23:03,167 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/folder_builder.py:22}  Experimental folder: /mnt/cfs/algorithm/linqing.zhao/haozhe/Tuplan_garage/exp/exp/training_pdm_open_model/training_pdm_open_model/2023.09.16.11.23.02
2023-09-16 11:23:03,168 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/worker_pool_builder.py:19}  Building WorkerPool...
2023-09-16 11:23:03,170 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/utils/multithreading/worker_ray.py:78}  Starting ray local!
Global seed set to 0
2023-09-16 11:23:12,023 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/folder_builder.py:20}  Building experiment folders...
2023-09-16 11:23:12,030 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/folder_builder.py:22}  Experimental folder: /mnt/cfs/algorithm/linqing.zhao/haozhe/Tuplan_garage/exp/exp/training_pdm_open_model/training_pdm_open_model/2023.09.16.11.23.11
2023-09-16 11:23:12,031 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/script/builders/worker_pool_builder.py:19}  Building WorkerPool...
2023-09-16 11:23:12,034 INFO {/home/linqing.zhao/nuplan-devkit/nuplan/planning/utils/multithreading/worker_ray.py:78}  Starting ray local!
2023-09-16 11:23:17,381 ERROR services.py:1207 -- Failed to start the dashboard 
2023-09-16 11:23:17,382 ERROR services.py:1232 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is.
2023-09-16 11:23:17,382 ERROR services.py:1242 -- Couldn't read dashboard.log file. Error: [Errno 2] No such file or directory: '/tmp/ray/session_2023-09-16_11-22-55_720350_89715/logs/dashboard.log'. It means the dashboard is broken even before it initializes the logger (mostly dependency issues). Reading the dashboard.err file which contains stdout/stderr.
2023-09-16 11:23:17,382 ERROR services.py:1276 -- Failed to read dashboard.err file: cannot mmap an empty file. It is unexpected. Please report an issue to Ray github. https://github.com/ray-project/ray/issues
2023-09-16 11:23:17,582 INFO worker.py:1621 -- Started a local Ray instance.
2023-09-16 11:23:25,116 ERROR services.py:1207 -- Failed to start the dashboard 
2023-09-16 11:23:25,116 ERROR services.py:1232 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure' to find where the log file is.
2023-09-16 11:23:25,116 ERROR services.py:1242 -- Couldn't read dashboard.log file. Error: [Errno 2] No such file or directory: '/tmp/ray/session_2023-09-16_11-23-03_304985_90490/logs/dashboard.log'. It means the dashboard is broken even before it initializes the logger (mostly dependency issues). Reading the dashboard.err file which contains stdout/stderr.
2023-09-16 11:23:25,116 ERROR services.py:1276 -- Failed to read dashboard.err file: cannot mmap an empty file. It is unexpected. Please report an issue to Ray github. https://github.com/ray-project/ray/issues
2023-09-16 11:23:25,233 INFO worker.py:1621 -- Started a local Ray instance.
[2023-09-16 11:23:26,416 E 89301 89301] core_worker.cc:201: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
DanielDauner commented 1 year ago

Duplicate of #15. Closing this issue.