EOFError when evaluating

DEQDON commented 3 years ago

Hi, I was running an evaluation with the pre-trained models, and encountered an EOFError after the code started to run normally for some time.

This is my script for evaluation: python main.py --split val_mt_small --eval 1 --auto_gpu_config 0 --num_processes 10 --num_episodes 71 --num_processes_per_gpu 5 --train_global 0 --train_local 0 --train_slam 0 --load_global pretrained_models/model_best.global --load_local pretrained_models/model_best.local --load_slam pretrained_models/model_best.slam --print_images 1 -d results/ --exp_name exp_pre -v 1 > log0421.txt 2>&1

Here is my log in the terminal (redundant iterations in the middle are ignored). (Another question aside, why there is no reward in the first several minutes? ) :


Dumping at results//models/exp_pre/
Namespace(alpha=0.99, auto_gpu_config=0, camera_height=1.25, clip_param=0.2, collision_threshold=0.2, cuda=True, du_scale=2, dump_location='results/', entropy_coef=0.001, env_frame_height=256, env_frame_width=256, eps=1e-05, eval=1, exp_loss_coeff=1.0, exp_name='exp_pre', frame_height=128, frame_width=128, gamma=0.99, global_downscaling=2, global_hidden_size=256, global_lr=2.5e-05, goals_size=2, hfov=90.0, load_global='pretrained_models/model_best.global', load_local='pretrained_models/model_best.local', load_slam='pretrained_models/model_best.slam', local_hidden_size=512, local_optimizer='adam,lr=0.0001', local_policy_update_freq=5, log_interval=10, map_pred_threshold=0.5, map_resolution=5, map_size_cm=2400, max_episode_length=1000, max_grad_norm=0.5, no_cuda=False, noise_level=1.0, noisy_actions=1, noisy_odometry=1, num_episodes=71, num_global_steps=40, num_local_steps=25, num_mini_batch=5, num_processes=10, num_processes_on_first_gpu=0, num_processes_per_gpu=5, obs_threshold=1, obstacle_boundary=5, pose_loss_coeff=10000.0, ppo_epoch=4, pretrained_resnet=1, print_images=1, proj_loss_coeff=1.0, randomize_env_every=1000, save_interval=1, save_periodic=100000, save_trajectory_data='0', seed=1, short_goal_dist=1, sim_gpu_id=0, slam_batch_size=72, slam_iterations=10, slam_memory_size=500000, slam_optimizer='adam,lr=0.0001', split='val_mt_small', task_config='tasks/pointnav_gibson.yaml', tau=0.95, total_num_scenes='auto', train_global=0, train_local=0, train_slam=0, use_deterministic_local=0, use_gae=False, use_pose_estimation=2, use_recurrent_global=0, use_recurrent_local=1, value_loss_coef=0.5, vis_type=1, vision_range=64, visualize=1)
...
I0423 05:09:22.963959 2397 simulator.py:80] Loaded navmesh data/scene_datasets/gibson/Elmira.navmesh
2021-04-23 05:09:22,965 initializing task Nav-v0
I0423 05:09:23.003241 2404 simulator.py:80] Loaded navmesh data/scene_datasets/gibson/Pablo.navmesh
2021-04-23 05:09:23,004 initializing task Nav-v0
I0423 05:09:23.131814 2403 simulator.py:80] Loaded navmesh data/scene_datasets/gibson/Greigsville.navmesh
2021-04-23 05:09:23,132 initializing task Nav-v0
I0423 05:09:23.141511 2392 simulator.py:80] Loaded navmesh data/scene_datasets/gibson/Edgemere.navmesh
2021-04-23 05:09:23,142 initializing task Nav-v0
I0423 05:09:23.175663 2402 simulator.py:80] Loaded navmesh data/scene_datasets/gibson/Eudora.navmesh
2021-04-23 05:09:23,176 initializing task Nav-v0
I0423 05:09:26.240513 2407 simulator.py:80] Loaded navmesh data/scene_datasets/gibson/Sisters.navmesh
2021-04-23 05:09:26,241 initializing task Nav-v0
I0423 05:09:26.261519 2391 simulator.py:80] Loaded navmesh data/scene_datasets/gibson/Denmark.navmesh
2021-04-23 05:09:26,263 initializing task Nav-v0
I0423 05:09:26.421341 2405 simulator.py:80] Loaded navmesh data/scene_datasets/gibson/Ribera.navmesh
2021-04-23 05:09:26,422 initializing task Nav-v0
I0423 05:09:26.574383 2406 simulator.py:80] Loaded navmesh data/scene_datasets/gibson/Sands.navmesh
2021-04-23 05:09:26,575 initializing task Nav-v0
I0423 05:09:28.335183 2408 simulator.py:80] Loaded navmesh data/scene_datasets/gibson/Swormville.navmesh
2021-04-23 05:09:28,336 initializing task Nav-v0
2021-04-23 05:09:28,364 Computing map for data/scene_datasets/gibson/Denmark.glb
2021-04-23 05:09:28,367 Computing map for data/scene_datasets/gibson/Edgemere.glb
2021-04-23 05:09:28,367 Computing map for data/scene_datasets/gibson/Elmira.glb
2021-04-23 05:09:28,368 Computing map for data/scene_datasets/gibson/Eudora.glb
2021-04-23 05:09:28,369 Computing map for data/scene_datasets/gibson/Greigsville.glb
2021-04-23 05:09:28,373 Computing map for data/scene_datasets/gibson/Ribera.glb
2021-04-23 05:09:28,372 Computing map for data/scene_datasets/gibson/Sisters.glb
2021-04-23 05:09:28,374 Computing map for data/scene_datasets/gibson/Swormville.glb
2021-04-23 05:09:28,377 Computing map for data/scene_datasets/gibson/Pablo.glb
2021-04-23 05:09:28,377 Computing map for data/scene_datasets/gibson/Sands.glb
findfont: Font family ['Helvetica'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Helvetica'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Helvetica'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Helvetica'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Helvetica'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Helvetica'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Helvetica'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Helvetica'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Helvetica'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Helvetica'] not found. Falling back to DejaVu Sans.
2021-04-23 05:19:31,135 Computing map for data/scene_datasets/gibson/Eudora.glb
2021-04-23 05:19:31,145 Computing map for data/scene_datasets/gibson/Edgemere.glb
2021-04-23 05:19:31,159 Computing map for data/scene_datasets/gibson/Sands.glb
2021-04-23 05:19:31,159 Computing map for data/scene_datasets/gibson/Ribera.glb
2021-04-23 05:19:31,159 Computing map for data/scene_datasets/gibson/Pablo.glb
2021-04-23 05:19:31,161 Computing map for data/scene_datasets/gibson/Greigsville.glb
2021-04-23 05:19:31,161 Computing map for data/scene_datasets/gibson/Elmira.glb
2021-04-23 05:19:31,166 Computing map for data/scene_datasets/gibson/Denmark.glb
2021-04-23 05:19:31,170 Computing map for data/scene_datasets/gibson/Sisters.glb
2021-04-23 05:19:31,173 Computing map for data/scene_datasets/gibson/Swormville.glb
Loading slam pretrained_models/model_best.slam
Loading global pretrained_models/model_best.global
Loading local pretrained_models/model_best.local
Time: 00d 00h 00m 00s, num timesteps 0, FPS 0,
    Rewards:
    Losses:
Time: 00d 00h 00m 06s, num timesteps 100, FPS 15,
    Rewards:
    Losses:
Time: 00d 00h 00m 12s, num timesteps 200, FPS 16,
    Rewards:
    Losses:
Time: 00d 00h 00m 17s, num timesteps 300, FPS 16,
    Rewards:
    Losses:
Time: 00d 00h 00m 23s, num timesteps 400, FPS 17,
    Rewards:
    Losses:
Time: 00d 00h 00m 29s, num timesteps 500, FPS 17,
    Rewards:
    Losses:
Time: 00d 00h 00m 35s, num timesteps 600, FPS 16,
    Rewards:
    Losses:
Time: 00d 00h 00m 41s, num timesteps 700, FPS 17,
    Rewards:
    Losses:
Time: 00d 00h 00m 46s, num timesteps 800, FPS 17,
    Rewards:
    Losses:
...
Time: 00d 00h 09m 12s, num timesteps 9300, FPS 16,
    Rewards:
    Losses:
Time: 00d 00h 09m 18s, num timesteps 9400, FPS 16,
    Rewards:
    Losses:
Time: 00d 00h 09m 24s, num timesteps 9500, FPS 16,
    Rewards:
    Losses:
Time: 00d 00h 09m 30s, num timesteps 9600, FPS 16,
    Rewards:
    Losses:
Time: 00d 00h 09m 36s, num timesteps 9700, FPS 16,
    Rewards:
    Losses:
Time: 00d 00h 09m 42s, num timesteps 9800, FPS 16,
    Rewards:
    Losses:
Time: 00d 00h 09m 48s, num timesteps 9900, FPS 16,
    Rewards:
    Losses:
Time: 00d 00h 10m 00s, num timesteps 10000, FPS 16,
    Rewards: Global step mean/med rew: 0.6008/0.0242,  Global eps mean/med/min/max eps rew: 24.032/22.616/16.667/33.158,
    Losses:
Time: 00d 00h 10m 06s, num timesteps 10100, FPS 16,
    Rewards: Global step mean/med rew: 0.6008/0.0242,  Global eps mean/med/min/max eps rew: 24.032/22.616/16.667/33.158,
    Losses:
Time: 00d 00h 10m 12s, num timesteps 10200, FPS 16,
    Rewards: Global step mean/med rew: 0.6008/0.0242,  Global eps mean/med/min/max eps rew: 24.032/22.616/16.667/33.158,
    Losses:
Time: 00d 00h 10m 17s, num timesteps 10300, FPS 16,
    Rewards: Global step mean/med rew: 0.7254/0.0267,  Global eps mean/med/min/max eps rew: 24.032/22.616/16.667/33.158,
    Losses:
Time: 00d 00h 10m 23s, num timesteps 10400, FPS 16,
    Rewards: Global step mean/med rew: 0.7254/0.0267,  Global eps mean/med/min/max eps rew: 24.032/22.616/16.667/33.158,
    Losses:
Time: 00d 00h 10m 29s, num timesteps 10500, FPS 16,
    Rewards: Global step mean/med rew: 0.8288/0.0292,  Global eps mean/med/min/max eps rew: 24.032/22.616/16.667/33.158,
    Losses:
Time: 00d 00h 10m 35s, num timesteps 10600, FPS 16,
    Rewards: Global step mean/med rew: 0.8288/0.0292,  Global eps mean/med/min/max eps rew: 24.032/22.616/16.667/33.158,
    Losses:
Time: 00d 00h 10m 41s, num timesteps 10700, FPS 16,
    Rewards: Global step mean/med rew: 0.8288/0.0292,  Global eps mean/med/min/max eps rew: 24.032/22.616/16.667/33.158,
    Losses:
Time: 00d 00h 10m 46s, num timesteps 10800, FPS 16,
    Rewards: Global step mean/med rew: 0.8729/0.0318,  Global eps mean/med/min/max eps rew: 24.032/22.616/16.667/33.158,
    Losses:
Time: 00d 00h 10m 52s, num timesteps 10900, FPS 16,
    Rewards: Global step mean/med rew: 0.8729/0.0318,  Global eps mean/med/min/max eps rew: 24.032/22.616/16.667/33.158,
    Losses:

...

Time: 00d 00h 23m 39s, num timesteps 23800, FPS 16,
    Rewards: Global step mean/med rew: 0.7613/0.0318,  Global eps mean/med/min/max eps rew: 24.121/22.616/16.280/35.575,
    Losses:
Time: 00d 00h 23m 45s, num timesteps 23900, FPS 16,
    Rewards: Global step mean/med rew: 0.7613/0.0318,  Global eps mean/med/min/max eps rew: 24.121/22.616/16.280/35.575,
    Losses:
Time: 00d 00h 23m 51s, num timesteps 24000, FPS 16,
    Rewards: Global step mean/med rew: 0.7543/0.0401,  Global eps mean/med/min/max eps rew: 24.121/22.616/16.280/35.575,
    Losses:
Traceback (most recent call last):
  File "main.py", line 769, in <module>
    main()
  File "main.py", line 534, in main
    output = envs.get_short_term_goal(planner_inputs)
  File "/home/xxx/Neural-SLAM/env/__init__.py", line 50, in get_short_term_goal
    stg = self.venv.get_short_term_goal(inputs)
  File "/home/xxx/Neural-SLAM/env/habitat/habitat_api/habitat/core/vector_env.py", line 475, in get_short_term_goal
    results.append(read_fn())
  File "/home/xxx/anaconda3/envs/ns/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/xxx/anaconda3/envs/ns/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/xxx/anaconda3/envs/ns/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

Following The tests mentioned in issue#29 I have tested the installation of habitat-api by python examples/benchmark.py and habitat-sim by python examples/example.py as specified in their corresponding docs. No error occurred in these steps.

Any help and advice would be appreciated!

devendrachaplot commented 3 years ago

Hi,

The error just indicates that the simulator crashed in at least one thread. I am not sure what went wrong, I would suggest trying the same command on another system and see if the experiment crashes at the same step.

The rewards are printed after at least one episode is completed, so that means after num_processes (10) x max_episode_length (1000) = 10000 steps in your case. You can modify the reward logging here: https://github.com/devendrachaplot/Neural-SLAM/blob/master/main.py#L644

DEQDON commented 3 years ago

Thanks for your advice! I still haven't fixed this issue. But running with auto_gpu_config=1 seems to work, except slowly. I guess it's something deep with the operating system or the environment.

devendrachaplot / Neural-SLAM

EOFError when evaluating #35