Closed AgentEXPL closed 3 years ago
I would like to close this issue sicen I got the appropriate values after trying several times.
Hi @AgentEXPL. Thank you for bringing up this issue. I've only trained this model with 8 GPUs and 16/32GB memory per GPU. So I don't have a good answer to this. But this is something valuable to know for others in the community. Can you send a pull request to the README on how you resolved this?
Hi @srama2512 . Actually, I am not sure whether my answer is the right answer. I only test two models: ans_depth model and occant_depth model. From my experience, the GPU for SIMULATOR would occupy relatively high GPU-memory with low GPU-util when the NUM_PROCESSES is relatively high. The NUM_PROCESSES can be set according to the total memory of the GPU, where each process would occupy nearly 2~3G memory. As for the GPU-util resource, it is mainly occupied by the MAPPER if occupancy anticipation model is used when ans_depth and occant_depth is compared, the occant_depth is able to achieve high GPU-util.
The following is one of my settings. I would like to set the GPU be used for both mapper and simulator if I do not have enough GPUs. Setting 1: two A40 GPUs, each of which has 48 memory. BASE_TASK_CONFIG_PATH: "configs/exploration/gibson_train.yaml" TRAINER_NAME: "occant_exp" ENV_NAME: "ExpRLEnv" SIMULATOR_GPU_ID: 4 SIMULATOR_GPU_IDS: [4,5] TORCH_GPU_ID: 4 VIDEO_OPTION: ["disk", "tensorboard"] TENSORBOARD_DIR: "tb" VIDEO_DIR: "video_dir" EVAL_CKPT_PATH_DIR: "data/new_checkpoints" NUM_PROCESSES: 18 SENSORS: ["RGB_SENSOR", "DEPTH_SENSOR"] CHECKPOINT_FOLDER: "data/new_checkpoints" NUM_EPISODES: 10000 T_EXP: 1000
RL: PPO:
ppo_epoch: 4
num_mini_batch: 4
ANS:
# reward_type: "map_accuracy"
image_scale_hw: [128, 128]
MAPPER:
map_size: 65
registration_type: "moving_average"
label_id: "ego_map_gt_anticipated"
ignore_pose_estimator: False
map_batch_size: 120
use_data_parallel: True
replay_size: 100000
gpu_ids: [4,5]
OCCUPANCY_ANTICIPATOR:
type: "occant_depth"
I used torch==1.9.0 and cuda=11.1, with two A40 GPUs. The code could work if "fill_value" is removed from the inputs to scatter_max function in Line 78 in mapnet.py . The fps is nearly 30.
What confuses me is that the "GPU-util" is extremely low (<10%) and the "GPU Memory-usage" is high (nearly 50%) when training a model (e.g., the "ans-depth" model). The GPU-util of the GPU for SIMULATOR and MAPPER is nearly 0. Is this normal or is there something wrong due to the change of installation environments?
What should I do to improve the GPU-util? Is it possible to improve the GPU-util by setting the config values in .yaml file? What's the relationship between RL.ANS.MAPPER.replay_size and map_batch_size and NUM_PROCESSES? I have no idea on how to setting appropriate values for "replay_size" and "map_batch_size". It will be of great help is some explanations could be provided.