allenai / embodied-clip

Official codebase for EmbCLIP
https://arxiv.org/abs/2111.09888
Apache License 2.0
111 stars 11 forks source link

[Help!]ConnectionResetError: [Errno 104] Connection reset by peer #6

Closed xuexidi closed 2 years ago

xuexidi commented 2 years ago

I train the Habitat Object nav use DDPPO baselines normally in my Ubuntu 18.04 server. python habitat_baselines/run.py --exp-config habitat_baselines/config/objectnav/ddppo_objectnav_rgb_clip.yaml --run-type train

When I back up the conda virtual environment and training code to other Ubuntu machines (Ubuntu 18.04 server, too) for training, an error appears in the title. What is the possible reason?

---
I0720 20:43:58.673173 30334 simulator.py:213] Loaded navmesh data/scene_datasets/mp3d/VFuaQ6m2Qom/VFuaQ6m2Qom.navmesh
I0720 20:43:58.673994 30334 simulator.py:225] Recomputing navmesh for agent's height 0.88 and radius 0.18.
I0720 20:43:58.686116 30334 PathFinder.cpp:382] Building navmesh with 704x949 cells
I0720 20:43:59.103587 30334 PathFinder.cpp:650] Created navmesh with 2584 vertices 1279 polygons
I0720 20:43:59.103703 30334 Simulator.cpp:710] reconstruct navmesh successful
2022-07-20 20:43:59,111 Initializing task ObjectNav-v1
Traceback (most recent call last):
  File "habitat_baselines/run.py", line 80, in <module>
    main()
  File "habitat_baselines/run.py", line 40, in main
    run_exp(**vars(args))
  File "habitat_baselines/run.py", line 76, in run_exp
    execute_exp(config, run_type)
  File "habitat_baselines/run.py", line 59, in execute_exp
    trainer.train()
  File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/contextlib.py", line 52, in inner
    return func(*args, **kwds)
  File "/home/xuexidi/xue/embclip-habitat/habitat_baselines/rl/ppo/ppo_trainer.py", line 724, in train
    self._init_train()
  File "/home/xuexidi/xue/embclip-habitat/habitat_baselines/rl/ppo/ppo_trainer.py", line 248, in _init_train
    self._init_envs()
  File "/home/xuexidi/xue/embclip-habitat/habitat_baselines/rl/ppo/ppo_trainer.py", line 201, in _init_envs
    workers_ignore_signals=is_slurm_batch_job(),
  File "/home/xuexidi/xue/embclip-habitat/habitat_baselines/utils/env_utils.py", line 107, in construct_envs
    workers_ignore_signals=workers_ignore_signals,
  File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 195, in __init__
    read_fn() for read_fn in self._connection_read_fns
  File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 195, in <listcomp>
    read_fn() for read_fn in self._connection_read_fns
  File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 97, in __call__
    res = self.read_fn()
  File "/home/xuexidi/xue/embclip-habitat/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
    buf = self.recv_bytes()
  File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <bound method VectorEnv.__del__ of <habitat.core.vector_env.VectorEnv object at 0x7f3145fe6e10>>
Traceback (most recent call last):
  File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 589, in __del__
    self.close()
  File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 457, in close
    read_fn()
  File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 97, in __call__
    res = self.read_fn()
  File "/home/xuexidi/xue/embclip-habitat/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
    buf = self.recv_bytes()
  File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError:
apoorvkh commented 2 years ago

It's really hard to say what's wrong from this error log. ConnectionResetError will appear when a lower level error occurs and the multiprocessing connection is interrupted. I don't see a more specific error here. Is this the full error log?

You should also check out the issues in facebookresearch/habitatlab. Otherwise, I think you said you copied the repo and conda env from one server to another? These might include machine-specific build files. Instead, can you follow the EmbodiedCLIP-Habitat instructions from scratch on your second machine?

xuexidi commented 2 years ago

It's really hard to say what's wrong from this error log. ConnectionResetError will appear when a lower level error occurs and the multiprocessing connection is interrupted. I don't see a more specific error here. Is this the full error log?

You should also check out the issues in facebookresearch/habitatlab. Otherwise, I think you said you copied the repo and conda env from one server to another? These might include machine-specific build files. Instead, can you follow the EmbodiedCLIP-Habitat instructions from scratch on your second machine?

@apoorvkh Hello, I posted the full runnning log below. Background: I have three Ubuntu 18.04 machines machines A: my own PC at my home, Ubuntu 18.04,Desktop version,RTX3060、16G RAM machines B: my school work station, Ubuntu 18.04,server version,Quadro M5000、60G RAM machines C: my school PC, Ubuntu 18.04,server version,RTX 3080、32G RAM.

Steps:

  1. I made conda env "embclip-habitat" (basic dependent environment (not habitat-lab and clip yet) described in branch habitat->environment.yml)on machines A.
  2. I copy the conda env "embclip-habitat" to machines B.
  3. I install habitat-lab and clip on machines A and machines B respectively. Set NUM_ENVIRONMENTS: 10, and every thing goes well.
  4. I back up the conda virtual environment (from step3. I tried conda env both from machines A and machines B), clip pre-trained models and the same training code to machines C (Ubuntu 18.04 server, too) for training, an error appears in the title.

questions: I've read the facebook/habitat-lab/issues, but found nothing useful. Is it possible that the problem is the copy of conda env "embclip-habitat"? In my conda env "embclip-habitat", some dependent packages are installed through CONDA, while some dependent packages are installed through pip.

Runing log:

(embclip-habitat) xuexidi@pxdevice:~/xue/embclip-habitat$ python habitat_baselines/run.py --exp-config habitat_baselines/config/objectnav/ddppo_objectnav_rgb_clip.yaml --run-type train
/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/site-packages/gym/core.py:26: UserWarning: WARN: Gym minimally supports python 3.6 as the python foundation not longer supports the version, please update your version to 3.7+
"Gym minimally supports python 3.6 as the python foundation not longer supports the version, please update your version to 3.7+"
2022-07-20 20:52:01,423 config: BASE_TASK_CONFIG_PATH: configs/tasks/objectnav_mp3d_rgb.yaml
CHECKPOINT_FOLDER: logs/objectnav-rgb-clip
CHECKPOINT_INTERVAL: -1
CMD_TRAILING_OPTS: ['TASK_CONFIG.ENVIRONMENT.ITERATOR_OPTIONS.MAX_SCENE_REPEAT_STEPS', '50000']
ENV_NAME: NavRLEnv
EVAL:
DEVICE: cpu
NUM_ENVIRONMENTS: 1
SPLIT: val_mini
USE_CKPT_CONFIG: False
EVAL_CKPT_PATH_DIR: logs/objectnav-rgb-clip
EVAL_DURING_TRAIN: True
FORCE_BLIND_POLICY: False
FORCE_TORCH_SINGLE_THREADED: True
LOG_FILE: logs/objectnav-rgb-clip/train.log
LOG_INTERVAL: 10
NUM_CHECKPOINTS: 10
NUM_ENVIRONMENTS: 10
NUM_PROCESSES: -1
NUM_UPDATES: -1
ORBSLAM2:
ANGLE_TH: 0.2617993877991494
BETA: 100
CAMERA_HEIGHT: 1.25
DEPTH_DENORM: 10.0
DIST_REACHED_TH: 0.15
DIST_TO_STOP: 0.05
D_OBSTACLE_MAX: 4.0
D_OBSTACLE_MIN: 0.1
H_OBSTACLE_MAX: 1.25
H_OBSTACLE_MIN: 0.375
MAP_CELL_SIZE: 0.1
MAP_SIZE: 40
MIN_PTS_IN_OBSTACLE: 320.0
NEXT_WAYPOINT_TH: 0.5
NUM_ACTIONS: 3
PLANNER_MAX_STEPS: 500
PREPROCESS_MAP: True
SLAM_SETTINGS_PATH: habitat_baselines/slambased/data/mp3d3_small1k.yaml
SLAM_VOCAB_PATH: habitat_baselines/slambased/data/ORBvoc.txt
PROFILING:
CAPTURE_START_STEP: -1
NUM_STEPS_TO_CAPTURE: -1
RL:
DDPPO:
backbone: resnet50_clip_avgpool
distrib_backend: NCCL
force_distributed: False
num_recurrent_layers: 2
pretrained: False
pretrained_encoder: False
pretrained_weights: data/ddppo-models/gibson-2plus-resnet50.pth
reset_critic: True
rnn_type: LSTM
sync_frac: 0.6
train_encoder: False
POLICY:
OBS_TRANSFORMS:
CENTER_CROPPER:
HEIGHT: 256
WIDTH: 256
CUBE2EQ:
HEIGHT: 256
SENSOR_UUIDS: []
WIDTH: 512
CUBE2FISH:
FOV: 180
HEIGHT: 256
PARAMS: (0.2, 0.2, 0.2)
SENSOR_UUIDS: []
WIDTH: 256
ENABLED_TRANSFORMS: ('ResizeShortestEdge', 'CenterCropper')
EQ2CUBE:
HEIGHT: 256
SENSOR_UUIDS: []
WIDTH: 256
RESIZE_SHORTEST_EDGE:
SIZE: 256
name: PointNavResNetPolicy
PPO:
clip_param: 0.2
entropy_coef: 0.01
eps: 1e-05
gamma: 0.99
hidden_size: 512
lr: 0.00025
max_grad_norm: 0.2
num_mini_batch: 2
num_steps: 64
ppo_epoch: 4
reward_window_size: 50
tau: 0.95
use_double_buffered_sampler: False
use_gae: True
use_linear_clip_decay: False
use_linear_lr_decay: False
use_normalized_advantage: False
value_loss_coef: 0.5
REWARD_MEASURE: distance_to_goal
SLACK_REWARD: -0.001
SUCCESS_MEASURE: spl
SUCCESS_REWARD: 2.5
SENSORS: ['RGB_SENSOR']
SIMULATOR_GPU_ID: 0
TASK_CONFIG:
DATASET:
CONTENT_SCENES: ['*']
DATA_PATH: data/datasets/objectnav/mp3d/v1/{split}/{split}.json.gz
SCENES_DIR: data/scene_datasets/
SPLIT: train
TYPE: ObjectNav-v1
ENVIRONMENT:
ITERATOR_OPTIONS:
CYCLE: True
GROUP_BY_SCENE: True
MAX_SCENE_REPEAT_EPISODES: -1
MAX_SCENE_REPEAT_STEPS: 10000
NUM_EPISODE_SAMPLE: -1
SHUFFLE: True
STEP_REPETITION_RANGE: 0.2
MAX_EPISODE_SECONDS: 10000000
MAX_EPISODE_STEPS: 500
PYROBOT:
BASE_CONTROLLER: proportional
BASE_PLANNER: none
BUMP_SENSOR:
TYPE: PyRobotBumpSensor
DEPTH_SENSOR:
CENTER_CROP: False
HEIGHT: 480
MAX_DEPTH: 5.0
MIN_DEPTH: 0.0
NORMALIZE_DEPTH: True
TYPE: PyRobotDepthSensor
WIDTH: 640
LOCOBOT:
ACTIONS: ['BASE_ACTIONS', 'CAMERA_ACTIONS']
BASE_ACTIONS: ['go_to_relative', 'go_to_absolute']
CAMERA_ACTIONS: ['set_pan', 'set_tilt', 'set_pan_tilt']
RGB_SENSOR:
CENTER_CROP: False
HEIGHT: 480
TYPE: PyRobotRGBSensor
WIDTH: 640
ROBOT: locobot
ROBOTS: ['locobot']
SENSORS: ['RGB_SENSOR', 'DEPTH_SENSOR', 'BUMP_SENSOR']
SEED: 100
SIMULATOR:
ACTION_SPACE_CONFIG: v1
AGENTS: ['AGENT_0']
AGENT_0:
ANGULAR_ACCELERATION: 12.56
ANGULAR_FRICTION: 1.0
COEFFICIENT_OF_RESTITUTION: 0.0
HEIGHT: 0.88
IS_SET_START_STATE: False
LINEAR_ACCELERATION: 20.0
LINEAR_FRICTION: 0.5
MASS: 32.0
RADIUS: 0.18
SENSORS: ['RGB_SENSOR']
START_POSITION: [0, 0, 0]
START_ROTATION: [0, 0, 0, 1]
DEFAULT_AGENT_ID: 0
DEPTH_SENSOR:
HEIGHT: 480
HFOV: 79
MAX_DEPTH: 5.0
MIN_DEPTH: 0.5
NORMALIZE_DEPTH: True
ORIENTATION: [0.0, 0.0, 0.0]
POSITION: [0, 0.88, 0]
TYPE: HabitatSimDepthSensor
WIDTH: 640
FORWARD_STEP_SIZE: 0.25
HABITAT_SIM_V0:
ALLOW_SLIDING: False
ENABLE_PHYSICS: False
GPU_DEVICE_ID: 0
GPU_GPU: False
PHYSICS_CONFIG_FILE: ./data/default.physics_config.json
RGB_SENSOR:
HEIGHT: 480
HFOV: 79
ORIENTATION: [0.0, 0.0, 0.0]
POSITION: [0, 0.88, 0]
TYPE: HabitatSimRGBSensor
WIDTH: 640
SCENE: data/scene_datasets/habitat-test-scenes/van-gogh-room.glb
SEED: 100
SEMANTIC_SENSOR:
HEIGHT: 480
HFOV: 79
ORIENTATION: [0.0, 0.0, 0.0]
POSITION: [0, 0.88, 0]
TYPE: HabitatSimSemanticSensor
WIDTH: 640
TILT_ANGLE: 30
TURN_ANGLE: 30
TYPE: Sim-v0
TASK:
ACTIONS:
ANSWER:
TYPE: AnswerAction
LOOK_DOWN:
TYPE: LookDownAction
LOOK_UP:
TYPE: LookUpAction
MOVE_FORWARD:
TYPE: MoveForwardAction
STOP:
TYPE: StopAction
TELEPORT:
TYPE: TeleportAction
TURN_LEFT:
TYPE: TurnLeftAction
TURN_RIGHT:
TYPE: TurnRightAction
ANSWER_ACCURACY:
TYPE: AnswerAccuracy
COLLISIONS:
TYPE: Collisions
COMPASS_SENSOR:
TYPE: CompassSensor
CORRECT_ANSWER:
TYPE: CorrectAnswer
DISTANCE_TO_GOAL:
DISTANCE_TO: VIEW_POINTS
TYPE: DistanceToGoal
EPISODE_INFO:
TYPE: EpisodeInfo
GOAL_SENSOR_UUID: objectgoal
GPS_SENSOR:
DIMENSIONALITY: 2
TYPE: GPSSensor
HEADING_SENSOR:
TYPE: HeadingSensor
IMAGEGOAL_SENSOR:
TYPE: ImageGoalSensor
INSTRUCTION_SENSOR:
TYPE: InstructionSensor
INSTRUCTION_SENSOR_UUID: instruction
MEASUREMENTS: ['DISTANCE_TO_GOAL', 'SUCCESS', 'SPL', 'SOFT_SPL']
OBJECTGOAL_SENSOR:
GOAL_SPEC: TASK_CATEGORY_ID
GOAL_SPEC_MAX_VAL: 50
TYPE: ObjectGoalSensor
POINTGOAL_SENSOR:
DIMENSIONALITY: 2
GOAL_FORMAT: POLAR
TYPE: PointGoalSensor
POINTGOAL_WITH_GPS_COMPASS_SENSOR:
DIMENSIONALITY: 2
GOAL_FORMAT: POLAR
TYPE: PointGoalWithGPSCompassSensor
POSSIBLE_ACTIONS: ['STOP', 'MOVE_FORWARD', 'TURN_LEFT', 'TURN_RIGHT', 'LOOK_UP', 'LOOK_DOWN']
PROXIMITY_SENSOR:
MAX_DETECTION_RADIUS: 2.0
TYPE: ProximitySensor
QUESTION_SENSOR:
TYPE: QuestionSensor
SENSORS: ['OBJECTGOAL_SENSOR', 'COMPASS_SENSOR', 'GPS_SENSOR']
SOFT_SPL:
TYPE: SoftSPL
SPL:
TYPE: SPL
SUCCESS:
SUCCESS_DISTANCE: 0.1
TYPE: Success
SUCCESS_DISTANCE: 0.1
TOP_DOWN_MAP:
DRAW_BORDER: True
DRAW_GOAL_AABBS: True
DRAW_GOAL_POSITIONS: True
DRAW_SHORTEST_PATH: True
DRAW_SOURCE: True
DRAW_VIEW_POINTS: True
FOG_OF_WAR:
DRAW: True
FOV: 90
VISIBILITY_DIST: 5.0
MAP_PADDING: 3
MAP_RESOLUTION: 1024
MAX_EPISODE_STEPS: 1000
TYPE: TopDownMap
TYPE: ObjectNav-v1
TENSORBOARD_DIR: logs/objectnav-rgb-clip/tb
TEST_EPISODE_COUNT: -1
TORCH_GPU_ID: 0
TOTAL_NUM_STEPS: 250000000.0
TRAINER_NAME: ddppo
VERBOSE: True
VIDEO_DIR: video_dir
VIDEO_OPTION: ['tensorboard']
2022-07-20 20:52:01,423 Initializing dataset ObjectNav-v1
.
.
.
..........  Some redundant logs are omitted here-------------
.
.
.
I0720 20:52:54.022358 31543 ResourceManager.cpp:234] ResourceManager::loadStage : Not loading semantic mesh
I0720 20:52:54.022378 31543 ResourceManager.cpp:262] ResourceManager::loadStage : start load render asset data/scene_datasets/mp3d/1pXnuDYAj8r/1pXnuDYAj8r.glb.
I0720 20:52:54.022383 31543 ResourceManager.cpp:569] ResourceManager::loadStageInternal : Attempting to load stage data/scene_datasets/mp3d/1pXnuDYAj8r/1pXnuDYAj8r.glb
I0720 20:52:54.022403 31543 ResourceManager.cpp:1119] Importing Basis files as BC7 for 1pXnuDYAj8r.glb
W0720 20:52:56.700733 31544 Simulator.cpp:248] :
The active scene does not contain semantic annotations.
I0720 20:52:56.831446 31544 simulator.py:213] Loaded navmesh data/scene_datasets/mp3d/Uxmj2M2itWa/Uxmj2M2itWa.navmesh
I0720 20:52:56.837675 31544 simulator.py:225] Recomputing navmesh for agent's height 0.88 and radius 0.18.
I0720 20:52:56.841107 31544 PathFinder.cpp:382] Building navmesh with 614x1023 cells
I0720 20:52:57.021582 31544 PathFinder.cpp:650] Created navmesh with 1050 vertices 535 polygons
I0720 20:52:57.021657 31544 Simulator.cpp:710] reconstruct navmesh successful
2022-07-20 20:52:57,041 Initializing task ObjectNav-v1
W0720 20:52:59.574173 31543 Simulator.cpp:248] :
The active scene does not contain semantic annotations.
I0720 20:52:59.747750 31543 simulator.py:213] Loaded navmesh data/scene_datasets/mp3d/1pXnuDYAj8r/1pXnuDYAj8r.navmesh
I0720 20:52:59.748421 31543 simulator.py:225] Recomputing navmesh for agent's height 0.88 and radius 0.18.
I0720 20:52:59.769464 31543 PathFinder.cpp:382] Building navmesh with 394x509 cells
I0720 20:53:00.013197 31543 PathFinder.cpp:650] Created navmesh with 1097 vertices 543 polygons
I0720 20:53:00.013267 31543 Simulator.cpp:710] reconstruct navmesh successful
2022-07-20 20:53:00,016 Initializing task ObjectNav-v1
W0720 20:53:01.302026 31542 Simulator.cpp:248] :
The active scene does not contain semantic annotations.
I0720 20:53:01.478476 31542 simulator.py:213] Loaded navmesh data/scene_datasets/mp3d/VFuaQ6m2Qom/VFuaQ6m2Qom.navmesh
I0720 20:53:01.479099 31542 simulator.py:225] Recomputing navmesh for agent's height 0.88 and radius 0.18.
I0720 20:53:01.487617 31542 PathFinder.cpp:382] Building navmesh with 704x949 cells
I0720 20:53:01.907059 31542 PathFinder.cpp:650] Created navmesh with 2584 vertices 1279 polygons
I0720 20:53:01.907184 31542 Simulator.cpp:710] reconstruct navmesh successful
2022-07-20 20:53:01,910 Initializing task ObjectNav-v1
Traceback (most recent call last):
File "habitat_baselines/run.py", line 80, in
main()
File "habitat_baselines/run.py", line 40, in main
run_exp(**vars(args))
File "habitat_baselines/run.py", line 76, in run_exp
execute_exp(config, run_type)
File "habitat_baselines/run.py", line 59, in execute_exp
trainer.train()
File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/contextlib.py", line 52, in inner
return func(*args, **kwds)
File "/home/xuexidi/xue/embclip-habitat/habitat_baselines/rl/ppo/ppo_trainer.py", line 724, in train
self._init_train()
File "/home/xuexidi/xue/embclip-habitat/habitat_baselines/rl/ppo/ppo_trainer.py", line 248, in _init_train
self._init_envs()
File "/home/xuexidi/xue/embclip-habitat/habitat_baselines/rl/ppo/ppo_trainer.py", line 201, in _init_envs
workers_ignore_signals=is_slurm_batch_job(),
File "/home/xuexidi/xue/embclip-habitat/habitat_baselines/utils/env_utils.py", line 107, in construct_envs
workers_ignore_signals=workers_ignore_signals,
File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 195, in init
read_fn() for read_fn in self._connection_read_fns
File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 195, in
read_fn() for read_fn in self._connection_read_fns
File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 97, in call
res = self.read_fn()
File "/home/xuexidi/xue/embclip-habitat/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
buf = self.recv_bytes()
File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <bound method VectorEnv.del of <habitat.core.vector_env.VectorEnv object at 0x7f708a102668>>
Traceback (most recent call last):
File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 589, in del
self.close()
File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 457, in close
read_fn()
File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 97, in call
res = self.read_fn()
File "/home/xuexidi/xue/embclip-habitat/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
buf = self.recv_bytes()
File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError:
apoorvkh commented 2 years ago

Is it possible that the problem is the copy of conda env "embclip-habitat"?

Yes, this is a possible problem since conda environment files can be machine dependent.

So, it would be best if you followed the instructions from scratch on your "machine C". Please try to replicate the baseline experiment on this machine without any changes, so we can determine if that works to begin with.

That means you should install the conda environment and cloning the codebase directly on that machine, instead of copying these files from your other machines.

xuexidi commented 2 years ago

@apoorvkh Hey,Thanks for your reply! I tried to install all environment in a totally new machine (with 2 P100 GPU). when I run: python habitat_baselines/run.py --exp-config habitat_baselines/config/objectnav/ddppo_objectnav_rgb_clip.yaml --run-type train or run:

export NUM_GPUS=1
export TASK=objectnav # {objectnav,pointnav}
export MODEL=clip     # {clip,imagenet}
GLOG_minloglevel=2 MAGNUM_LOG=quiet \
python -u -m torch.distributed.launch \
    --use_env \
    --nproc_per_node $NUM_GPUS \
    habitat_baselines/run.py \
    --exp-config habitat_baselines/config/${TASK}/ddppo_${TASK}_rgb_${MODEL}.yaml \
    --run-type train

The training goes well.

But when I tried to use 2 GPU to train (set export NUM_GPUS=2):

export NUM_GPUS=2
export TASK=objectnav # {objectnav,pointnav}
export MODEL=clip     # {clip,imagenet}
GLOG_minloglevel=2 MAGNUM_LOG=quiet \
python -u -m torch.distributed.launch \
    --use_env \
    --nproc_per_node $NUM_GPUS \
    habitat_baselines/run.py \
    --exp-config habitat_baselines/config/${TASK}/ddppo_${TASK}_rgb_${MODEL}.yaml \
    --run-type train

The training appear errors like before: This error makes me very frustrated, because the inability to use multi GPU training means that I can't quickly verify the code and make experimental optimization..... Please help me, or teach me how to debug and find the root cause of this error...

..........  Some redundant logs are omitted here-------------
.
.
.
I0720 20:52:54.022358 31543 ResourceManager.cpp:234] ResourceManager::loadStage : Not loading semantic mesh
I0720 20:52:54.022378 31543 ResourceManager.cpp:262] ResourceManager::loadStage : start load render asset data/scene_datasets/mp3d/1pXnuDYAj8r/1pXnuDYAj8r.glb.
I0720 20:52:54.022383 31543 ResourceManager.cpp:569] ResourceManager::loadStageInternal : Attempting to load stage data/scene_datasets/mp3d/1pXnuDYAj8r/1pXnuDYAj8r.glb
I0720 20:52:54.022403 31543 ResourceManager.cpp:1119] Importing Basis files as BC7 for 1pXnuDYAj8r.glb
W0720 20:52:56.700733 31544 Simulator.cpp:248] :
The active scene does not contain semantic annotations.
I0720 20:52:56.831446 31544 simulator.py:213] Loaded navmesh data/scene_datasets/mp3d/Uxmj2M2itWa/Uxmj2M2itWa.navmesh
I0720 20:52:56.837675 31544 simulator.py:225] Recomputing navmesh for agent's height 0.88 and radius 0.18.
I0720 20:52:56.841107 31544 PathFinder.cpp:382] Building navmesh with 614x1023 cells
I0720 20:52:57.021582 31544 PathFinder.cpp:650] Created navmesh with 1050 vertices 535 polygons
I0720 20:52:57.021657 31544 Simulator.cpp:710] reconstruct navmesh successful
2022-07-20 20:52:57,041 Initializing task ObjectNav-v1
W0720 20:52:59.574173 31543 Simulator.cpp:248] :
The active scene does not contain semantic annotations.
I0720 20:52:59.747750 31543 simulator.py:213] Loaded navmesh data/scene_datasets/mp3d/1pXnuDYAj8r/1pXnuDYAj8r.navmesh
I0720 20:52:59.748421 31543 simulator.py:225] Recomputing navmesh for agent's height 0.88 and radius 0.18.
I0720 20:52:59.769464 31543 PathFinder.cpp:382] Building navmesh with 394x509 cells
I0720 20:53:00.013197 31543 PathFinder.cpp:650] Created navmesh with 1097 vertices 543 polygons
I0720 20:53:00.013267 31543 Simulator.cpp:710] reconstruct navmesh successful
2022-07-20 20:53:00,016 Initializing task ObjectNav-v1
W0720 20:53:01.302026 31542 Simulator.cpp:248] :
The active scene does not contain semantic annotations.
I0720 20:53:01.478476 31542 simulator.py:213] Loaded navmesh data/scene_datasets/mp3d/VFuaQ6m2Qom/VFuaQ6m2Qom.navmesh
I0720 20:53:01.479099 31542 simulator.py:225] Recomputing navmesh for agent's height 0.88 and radius 0.18.
I0720 20:53:01.487617 31542 PathFinder.cpp:382] Building navmesh with 704x949 cells
I0720 20:53:01.907059 31542 PathFinder.cpp:650] Created navmesh with 2584 vertices 1279 polygons
I0720 20:53:01.907184 31542 Simulator.cpp:710] reconstruct navmesh successful
2022-07-20 20:53:01,910 Initializing task ObjectNav-v1
Traceback (most recent call last):
File "habitat_baselines/run.py", line 80, in
main()
File "habitat_baselines/run.py", line 40, in main
run_exp(**vars(args))
File "habitat_baselines/run.py", line 76, in run_exp
execute_exp(config, run_type)
File "habitat_baselines/run.py", line 59, in execute_exp
trainer.train()
File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/contextlib.py", line 52, in inner
return func(*args, **kwds)
File "/home/xuexidi/xue/embclip-habitat/habitat_baselines/rl/ppo/ppo_trainer.py", line 724, in train
self._init_train()
File "/home/xuexidi/xue/embclip-habitat/habitat_baselines/rl/ppo/ppo_trainer.py", line 248, in _init_train
self._init_envs()
File "/home/xuexidi/xue/embclip-habitat/habitat_baselines/rl/ppo/ppo_trainer.py", line 201, in _init_envs
workers_ignore_signals=is_slurm_batch_job(),
File "/home/xuexidi/xue/embclip-habitat/habitat_baselines/utils/env_utils.py", line 107, in construct_envs
workers_ignore_signals=workers_ignore_signals,
File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 195, in init
read_fn() for read_fn in self._connection_read_fns
File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 195, in
read_fn() for read_fn in self._connection_read_fns
File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 97, in call
res = self.read_fn()
File "/home/xuexidi/xue/embclip-habitat/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
buf = self.recv_bytes()
File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <bound method VectorEnv.del of <habitat.core.vector_env.VectorEnv object at 0x7f708a102668>>
Traceback (most recent call last):
File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 589, in del
self.close()
File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 457, in close
read_fn()
File "/home/xuexidi/xue/embclip-habitat/habitat/core/vector_env.py", line 97, in call
res = self.read_fn()
File "/home/xuexidi/xue/embclip-habitat/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
buf = self.recv_bytes()
File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/xuexidi/anaconda3/envs/embclip-habitat/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError:

============================================================================ I install the environment by: 1、pip install requirement.txt in habitat-sim 2、conda install --use-local habitat-sim (1.7.0) headless withbullet ........(my server can't connect conda server, so I installed it locally) 3、pip install requirement.txt in habitat-lab(1.7.0) 4、use python setup.py develop to install habitat-lab(1.7.0) 5、use python setup.py install to install clip

Through the above settings, I can train with a single GPU, but once I set up multi GPU training, the error described above will be displayed........................ : (

apoorvkh commented 2 years ago

There is no need for you to follow the original Habitat installation instructions. Please follow the installation instructions I've written here.

I've simplified it so that all appropriate libraries should install with just the following lines:

git clone -b habitat --single-branch https://github.com/allenai/embodied-clip.git embclip-habitat
cd embclip-habitat
conda env create --name embclip-habitat
conda activate embclip-habitat

Again, please refer to the instructions I've linked above for additional details. E.g. you might need to modify the appropriate cudatoolkit version.

I'm sorry to say that the error log you've provided is not very informative. You'll have to take that up with the habitat-lab developers.

xuexidi commented 2 years ago

There is no need for you to follow the original Habitat installation instructions. Please follow the installation instructions I've written here.

I've simplified it so that all appropriate libraries should install with just the following lines:

git clone -b habitat --single-branch https://github.com/allenai/embodied-clip.git embclip-habitat
cd embclip-habitat
conda env create --name embclip-habitat
conda activate embclip-habitat

Again, please refer to the instructions I've linked above for additional details. E.g. you might need to modify the appropriate cudatoolkit version.

I'm sorry to say that the error log you've provided is not very informative. You'll have to take that up with the habitat-lab developers.

@apoorvkh Hi,sorry for the late reply. I found that the error reported above may not be caused by improper environmental dependence.

On the server where the training code will report the above error, I tried to reduce the scene number (The original training set contained about 50+ training scenes, and I reduced it to about 15) in mp3d_habitat.zip training dataset yesterday.

I was surprised to find that although the training code will still report the same error, the program did not exit abnormally, but continued to start normal training.

Therefore, I carefully increased the number of scenes in the training set until 18 scenes, and the program began to report errors and automatically quit the training.

After reading the code, I found that the variable NUM_ENVIRONMENTS only controls the number of processes opened. No matter how many processes are opened, the program will load the scene data in all training sets evenly into each process during training initialization (embclip-habitat/habitat_baselines/rl/ppo/ppo_trainer.py--->embclip-habitat/habitat_baselines/utils/env_utils.py line 91).

So I naturally doubt whether the error message is caused by insufficient memory on this server. But through the experiment, I got the following conclusions:

NUM_ENVIRONMENTS      Scene number in traning set      Is training normaly      Remaining RAM      Remaining video memory
       2                           2                            yes                   45G                     20G
       3                           3                            yes                   43G                     19G
                                                          .
                                                          .
                                                          .
       17                           17                           yes                  30G                    10G
       18                           18                           no                   27G                     9G
       17                           18                           no                   27G                     9G
       16                           18                           no                   27G                     9G
       15                           18                           no                   27G                     9G
       14                           18                           no                   27G                     9G
       13                           18                           no                   27G                     9G
                                                          .
                                                          .
                                                          .
       19                           19                           no                   27G                     9G
       18                           19                           no                   27G                     9G
       17                           19                           no                   27G                     9G
       16                           19                           no                   27G                     9G
       15                           19                           no                   27G                     9G
       14                           19                           no                   27G                     9G
                                                          .
                                                          .
                                                          .
       16                           17                           yes                  30G                     10G
       15                           17                           yes                  30G                     10G
       14                           17                           yes                  30G                     10G
       13                           17                           yes                  30G                     10G
                                                          .
                                                          .
                                                          .
       16                           16                           yes                  31G                     11G
       15                           16                           yes                  31G                     11G
       14                           16                           yes                  31G                     11G
       13                           16                           yes                  31G                     11G

So I'm very confused. It doesn't seem to be a problem of environment dependence or lack of RAM.

By the way, I want to migrate zeroshot object navigation from Robothor to habitat, do you think this can be achieved?Is there anything that needs extra attention?

xuexidi commented 2 years ago

@apoorvkh

UPDATE

[Information sharing] Against the background of the above problems, I have recently made some new discoveries. I want to share them to help those in need.

Habitat Object Navigation's MP3D dataset contains about 55 training scenarios. In the training scenario data set, some scenarios are very large and some scenarios are not very large.

One day, I found that the reason for the error I mentioned above was that too large training scenarios were loaded, even though the GPU memory and RAM were sufficient.

Therefore, I tried to eliminate all too large scenes in the training data set, and then I found that although the above error would still be reported, the program would not exit, but could run normally.

So far, no matter whether I set NUM_ENVIRONMENTS to 15 or 20, the program can be trained normally, and even can use multiple GPUs for training normally!

It is worth noting that the scene data set needs to be carefully selected to prevent the program from exiting due to an exception caused by loading too large scene data sets.

apoorvkh commented 2 years ago

I am glad you were able to resolve this issue. It sounds like the resources on your server were limited for large scenes in Habitat.

Regarding zeroshot object navigation in Habitat, I recommend that you take a look at this paper: ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings.

xuexidi commented 2 years ago

I am glad you were able to resolve this issue. It sounds like the resources on your server were limited for large scenes in Habitat.

Regarding zeroshot object navigation in Habitat, I recommend that you take a look at this paper: ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings.

Hey, thanks for your suggestion, I have read relevant contents of ZSON, I found that most of the current research work on object navigation does not have adaptive ability. If the target object appears in an unusual place (the position of the target object changes), most algorithms will fail. Therefore, I think it will be of greater research and commercial significance to study how to make the agent quickly adapt to a new environment and how to adapt to changes in the position of objects, do you have any recommended paper or ideas about this research?

apoorvkh commented 2 years ago

You might be interested in ProcTHOR for large-scale procedurally generated environments!

xuexidi commented 2 years ago

You might be interested in ProcTHOR for large-scale procedurally generated environments!

Thank you for your sharing. I'll go and study ProcTHOR~