unable to reproduce correct model

buaazeus commented 7 months ago

hello, I trained the model with the config below.

torchrun --nnodes=1 --nproc_per_node=6 --max_restarts=1 --rdzv_id=42353467 \ --rdzv_backend=c10d ./team_code/train.py --id train_id_002 --batch_size 32 \ --setting all --root_dir ./data --logdir ./output --use_controller_input_prediction 1 \ --use_wp_gru 0 --use_discrete_command 1 --use_tp 1 --continue_epoch 1 --cpu_cores 1 --num_repetitions 3 \

I didn't change anything else. The model can not drive properly in longest6, is there anything I missed to set in the config? Thank you.

buaazeus commented 7 months ago

{ "id": "train_id_002", "epochs": 31, "lr": 0.0003, "batch_size": 32, "logdir": "./output/train_id_002", "load_file": null, "setting": "all", "root_dir": "./data", "schedule_reduce_epoch_01": 30, "schedule_reduce_epoch_02": 40, "backbone": "transFuser", "image_architecture": "regnety_032", "lidar_architecture": "regnety_032", "use_velocity": 1, "n_layer": 2, "val_every": 2, "sync_batch_norm": false, "zero_redundancy_optimizer": 1, "use_disk_cache": 0, "lidar_seq_len": 1, "realign_lidar": 1, "use_ground_plane": 0, "use_controller_input_prediction": 1, "use_wp_gru": 0, "pred_len": 8, "estimate_class_distributions": 0, "use_focal_loss": 0, "use_cosine_schedule": 0, "augment": 1, "use_plant": 0, "learn_origin": 1, "local_rank": -999, "train_sampling_rate": 1, "use_amp": 0, "use_grad_clip": 0, "use_color_aug": 1, "use_semantic": 1, "use_depth": 1, "detect_boxes": 1, "use_bev_semantic": 1, "estimate_semantic_distribution": 0, "use_discrete_command": 1, "gru_hidden_size": 64, "use_cutout": 0, "add_features": 1, "freeze_backbone": 0, "learn_multi_task_weights": 0, "transformer_decoder_join": 1, "bev_down_sample_factor": 4, "perspective_downsample_factor": 1, "gru_input_size": 256, "num_repetitions": 3, "bev_grid_height_downsample_factor": 1, "wp_dilation": 1, "use_tp": 1, "continue_epoch": 1, "max_height_lidar": 100.0, "smooth_route": 1, "num_lidar_hits_for_detection": 7, "use_speed_weights": 1, "max_num_bbs": 30, "use_optim_groups": 0, "weight_decay": 0.01, "use_plant_labels": 0, "use_label_smoothing": 0, "cpu_cores": 1, "tp_attention": 0, "multi_wp_output": 0 }

Kait0 commented 7 months ago

Your settings look mostly fine. You don't seem to do two stage training which will reduce DS but only for a few points. If I understood you correctly the model didn't drive at all, so this isn't the main issue. --cpu_cores 1 is strange, either set it to number of cores on your machine or 0 if you want to turn off multithreading. This setting only affects training speed so its not the problem.

I would suspect that the training is not the issue here, but you should make sure that the training losses properly went down.

Can you turn on inference debug mode by setting the env. variables:

SAVE_PATH=/some/path/# If set to system folder, this folder will be used as route to store logging and debug information.
DEBUG_CHALLENGE=1 # 1: Generate visualization images at SAVE_PATH

to see how the model output looks like.

When loading the model you should make sure that there is no unnecessary files like the optimizer file in the folder you load, otherwise model loading might fail silently (the provided script links/copies the relvant file to an eval folder).

buaazeus commented 7 months ago

I did not move the optimizer.pth file, etc. I use local_evaluation.sh in one node instead of evaluate_routes_slurm.py, so I did not noticed the model copy code. Now it drives fine. Thank you!

autonomousvision / carla_garage

unable to reproduce correct model #27