Closed SmartAndCleverRobot closed 2 years ago
I found that the error occurs when the program runs to line 1433 of
embclip-zeroshot/allenact/algorithms/onpolicy_sync/engine.py
num_done = int(self.num_workers_done.get("done"))
self.num_workers_done is defined as follows
self.num_workers_done = torch.distributed.PrefixStore( # type:ignore
"num_workers_done", self.store
)
What could be causing the above error?
done!I replaced python3.10 with python3.8 and torch-1.11 with torch-1.8.1
I train the RoboTHOR ObjectNav use DDPPO baselines normally in my Ubuntu 20.04 server.
PYTHONPATH=. python allenact/main.py -o storage/objectnav-robothor-rgb-clip-rn50 -b projects/objectnav_baselines/experiments/robothor/clip objectnav_robothor_rgb_clipresnet50gru_ddppo
When I configured the environment and started training according to the instructions, an EOFError error was reported. I searched for a long time and could not find the reason. Can the author give me some help? thank you very much