MarSaKi / VLN-BEVBert

[ICCV 2023} Official repo of "BEVBert: Multimodal Map Pre-training for Language-guided Navigation"
168 stars 4 forks source link

An EOF error on R2R-CE #13

Open Bowen-sdu opened 3 months ago

Bowen-sdu commented 3 months ago

When I made fine-tuning on R2R-CE, the program reported an EOF error, I don't know why? The command I executed is

CUDA_VISIBLE-DEVICES=0,1,2,3 bash run_r2r/main.bash train 2333

And the installation environment is consistent with environment.txt.

###### train mode ######
2024-03-27 16:30:19,659 Initializing dataset VLN-CE-v1
2024-03-27 16:30:20,222 SPLTI: train, NUMBER OF SCENES: 61
2024-03-27 16:30:23,015 Initializing dataset VLN-CE-v1
2024-03-27 16:30:23,565 initializing sim Sim-v1
2024-03-27 16:30:36,184 Initializing task VLN-v0
2024-03-27 16:30:36,372 LOCAL RANK: 0, ENV NUM: 1, DATASET LEN: 10819
2024-03-27 16:30:43,860 Agent parameters: 337.67 MB. Trainable: 180.98 MB.
2024-03-27 16:30:43,860 Finished setting up policy.
2024-03-27 16:30:43,863 Traning Starts... GOOD LUCK!
Traceback (most recent call last):
  File "run.py", line 114, in <module>
    main()
  File "run.py", line 50, in main
    run_exp(**vars(args))
  File "run.py", line 107, in run_exp
    trainer.train()
  File "/home/huangbw/navigation/BEVBert/bevbert_ce/vlnce_baselines/ss_trainer_BEV.py", line 666, in train
    logs = self._train_interval(interval, self.config.IL.ml_weight, sample_ratio)  # (200, 1.0, 0.75)
  File "/home/huangbw/navigation/BEVBert/bevbert_ce/vlnce_baselines/ss_trainer_BEV.py", line 698, in _train_interval
    self.rollout('train', ml_weight, sample_ratio)
  File "/home/huangbw/navigation/BEVBert/bevbert_ce/vlnce_baselines/ss_trainer_BEV.py", line 1095, in rollout
    teacher_actions = self._teacher_action_new(nav_inputs['gmap_vp_ids'], no_vp_left)
  File "/home/huangbw/navigation/BEVBert/bevbert_ce/vlnce_baselines/ss_trainer_BEV.py", line 322, in _teacher_action_new
    curr_dis_to_goal = self.envs.call_at(i, "current_dist_to_goal")
  File "/home/huangbw/navigation/habitat-lab/habitat/core/vector_env.py", line 515, in call_at
    result = self._connection_read_fns[index]()
  File "/home/huangbw/navigation/habitat-lab/habitat/core/vector_env.py", line 97, in __call__
    res = self.read_fn()
  File "/home/huangbw/navigation/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
    buf = self.recv_bytes()
  File "/home/huangbw/miniconda3/envs/python36/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/huangbw/miniconda3/envs/python36/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/huangbw/miniconda3/envs/python36/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
Exception ignored in: <bound method VectorEnv.__del__ of <habitat.core.vector_env.VectorEnv object at 0x7f6d72a15cc0>>
Traceback (most recent call last):
  File "/home/huangbw/navigation/habitat-lab/habitat/core/vector_env.py", line 589, in __del__
    self.close()
  File "/home/huangbw/navigation/habitat-lab/habitat/core/vector_env.py", line 456, in close
    read_fn()
  File "/home/huangbw/navigation/habitat-lab/habitat/core/vector_env.py", line 97, in __call__
    res = self.read_fn()
  File "/home/huangbw/navigation/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
    buf = self.recv_bytes()
  File "/home/huangbw/miniconda3/envs/python36/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/huangbw/miniconda3/envs/python36/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/huangbw/miniconda3/envs/python36/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError:
MarSaKi commented 3 months ago

Hi, there, I frankly don't know why this bug happens. It may be an inherited bug from Habitat and I used to meet the same problem. My solution is to restart the machine or change to another machine.

Bowen-sdu commented 3 months ago

Thank you very much for your reply. I will try to do so.

Hi, there, I frankly don't know why this bug happens. It may be an inherited bug from Habitat and I used to meet the same problem. My solution is to restart the machine or change to another machine.

Bowen-sdu commented 3 months ago

Hi, there, I frankly don't know why this bug happens. It may be an inherited bug from Habitat and I used to meet the same problem. My solution is to restart the machine or change to another machine.

Hello, I tried running this script on another server, but the error still occurred. Afterwards, I upgraded the version of habitat sim to 0.2.0. Although this issue was resolved, another error occurred as follows:

######Train mode######
March 30, 2024 11:16:24452 Initiating dataset VLN-CE-v1
2024-03-30 11:16:25056 SPLTI: train, NUMBER OF SCENES: 61
March 30, 2024 11:16:27929 Initiating dataset VLN-CE-v1
March 30, 2024 11:16:28513 initializing sim Sim v1
Warning: Logging before InitGoogleLogging() is written to STDERR
E0330 11:16:28.628017 90056 SemanticScene. h:155]:: loadSemanticScene Descriptor: File data/scene-datasets/mp3d/e9zR4mvMWw7/e9zR4mvMWw7. scn does not exist Aborting load
March 30, 2024 11:16:41681 Initiating task VLN-v0
2024-03-30 11:16:41859 LOCAL RANK: 0, ENV NUM: 1, DATASET LEN: 10819
2024-03-30 11:16:49562 Agent parameters: 337.67 MB Traineable: 180.98 MB
March 30, 2024 11:16:49562 Finished setting up policy
March 30, 2024 11:16:49568 Training Starts GOOD LUCK!
Traceback (most recent call last):
File "run. py", line 114, in<module>
Main()
File "run. py", line 50, in main
Run_exp (* * vars (args))
File "run. py", line 107, in run_exp
Trainer. train()
File "/home/huangbw/navigation/BEVBert/bevbertce/vlnce_baselines/ss_trainer_BEV. py", line 666, in train
Logs=self_ Train_interval (interval, self. config. IL. ml_weight, sample_ratio) # (200, 1.0, 0.75)
File "/home/huangbw/navigation/BEVBert/bevbertce/vlnce_baselines/ss_trainer_BEV. py", line 698, in_train_interval
Self. roll out ('train ', ml_weight, sample_ratio)
File "/home/huangbw/navigation/BEVBert/bevbertce/vlnce_baselines/ss_trainer_BEV. py", line 973, in roll out
Batch=batch_obs (observations, self. device)
File "/home/huangbw/miniconde3/envs/python36/lib/python3.6/site packages/torch/autorad/grad_mode. py", line 28, in decorate_context
Return fun (* args, * * kwargs)
File "/home/huangbw/miniconde3/envs/python36/lib/python3.6/contextlib. py", line 52, in inner
Return function (* args, * * kwds)
File "/home/huangbw/navigation/hat lat lab-0.2.0/hat lat baselines/utils/common. py", line 171, in batch obs
Reverse=True,
File "/home/huangbw/navigation/hat lat lab-0.2.0/hat lat baselines/utils/common. py", line 170, in<lambda>
Else np. prod (obs [name]. shape),
AttributeError: 'list' object has no attribute 'shape'
Exception ignored in:<bound method VectorEnv__ Del__ of<habitat. core. vector env VectorEnv object at 0x7f5e2ed90390>>
Traceback (most recent call last):
File "/home/huangbw/navigation/hat lat lab-0.2.0/hat/core/vector env. py", line 588, in __ del__
Self. close()
File "/home/huangbw/navigation/hat lat lab-0.2.0/hat/core/vector env. py", line 459, in close
Write.fn ((CLOSE-COMMAND, None))
File "/home/huangbw/navigation/hat lat lab-0.2.0/hat/core/vector env. py", line 118, in __ call__
Self. write.fn (data)
File "/home/huangbw/navigation/hat lat lab-0.2.0/hat/utils/pickle5_multiprocessing. py", line 63, in send
Self. send_bytes (buf. getvalue())
File "/home/huangbw/miniconda3/envs/python36/lib/python3.6/multiprocessing/connection. py", line 200, in send_bytes
Self_ Sendbytes (m [offset: offset+size])
File "/home/huangbw/miniconda3/envs/python36/lib/python3.6/multiprocessing/connection. py", line 404, in _send_bytes
Self_ Send (header+buf)
File "/home/huangbw/miniconda3/envs/python36/lib/python3.6/multiprocessing/connection. py", line 368, in _send
N=write (self. _handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

I couldn't find what the data/scene_datasets/mp3d/e9zR4mvMWw7/e9zR4mvMWw7.scn file is. Is it from the MP3D dataset or do I need to extract it myself? I am looking forward to your reply.

MarSaKi commented 3 months ago

No, this repo doesn't support Habitat 0.2.0 and it will result in some strange bugs. I suggest you uncomment "export GLOG_minloglevel=2 export MAGNUM_LOG=quiet" in the bash script to see the detailed bug logs.

Bowen-sdu commented 3 months ago

Thank you very much for your suggestions and help. The reason for this EOF error is that I did not put the . navmesh file from MP3D into the scene_datasets folder. Now I have solved this problem.

dongxinfeng1 commented 1 month ago

Thank you very much for your suggestions and help. The reason for this EOF error is that I did not put the . navmesh file from MP3D into the _scenedatasets folder. Now I have solved this problem.

Hi, I also encounter this problem and can you describe your solution in detail? Thanks a lot!

Bowen-sdu commented 1 month ago

Thank you very much for your suggestions and help. The reason for this EOF error is that I did not put the . navmesh file from MP3D into the _scenedatasets folder. Now I have solved this problem.

Hi, I also encounter this problem and can you describe your solution in detail? Thanks a lot!

Of course. The reason why I encountered this issue before was that I only placed the . glb file in the scene_datasets folder. Please ensure that you place the relevant configuration files for each scan in the scene_datasets folder, including files with suffixes such as . glb,. house,. navmesh, and. ply.