When I ran the Room Rearrangement task experiment, EOFerror appeared

twb1235 commented 2 years ago

Problem

When I run the following command, the error in the screenshot appears。 ‘allenact -o rearrange_out -b . baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py ’ The error occurred after the program had been running for some time

Screenshots

Please add the following information: OS: Ubuntu 9.3.0-17ubuntu1~20.04 Allenact: 0.4.o Allenact-plugins: 0.4.0

Lucaweihs commented 2 years ago

Hi @twb1235 ,

Apologies for the delay, I have been away for the holidays. Usually the error you see there is due to running out of RAM, GPU memory, or available threads. Can you let me know:

How much RAM you have?
How many GPUs you're using + their details.
How many processes your machine has?

One thing you might want to try: open baseline_configs/rearrange_base.py and find all instances of if torch.cuda.is_available() else and change them to if False else. Upon rerunning

allenact -o rearrange_out -b . baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py

you should then only be running with a single process and, if that works, you can try increasing the number of processes until errors occur, see the line:

nprocesses = cls.num_train_processes() if torch.cuda.is_available() else 1

in baseline_configs/rearrange_base.py.

Lucaweihs commented 2 years ago

I'm going to close this issue. Please feel free to reopen if you're still having trouble.

allenai / allenact

When I ran the Room Rearrangement task experiment, EOFerror appeared #324

Problem

Screenshots