Closed YYDS-cc closed 1 year ago
Additionally, when I run python main.py object_nav_ithor_ppo_one_object -b projects/tutorials -s 12345 the monitor goes black momentarily, I know this is to open the search window, but after the monitor is back up, the terminal's info is no longer updated. I have also run sudo python scripts/startx.py & but it doesn't do anything.
Hi @YDDS-cc,
Given your setup, I think it would be worth it to try using THOR in headless mode. For that, you need to pass a gpu_device
instead of an x_display
(using the CloudRendering
platform). You can see an example here:
Let us know if this unblocked you!
Hi @jordis-ai2 , i try to change the headless to True, it doesn't working. https://github.com/allenai/allenact/blob/9772eeeb7eacc1f9a83c90d1cf549a3f7e783c12/projects/objectnav_baselines/experiments/objectnav_thor_base.py#L75
And i also try to comment out these code, It's still not working. https://github.com/allenai/allenact/blob/9772eeeb7eacc1f9a83c90d1cf549a3f7e783c12/projects/objectnav_baselines/experiments/objectnav_thor_base.py#L236
Did I change the code in the wrong place?
I think it I need to see the output you get when using headless mode. Can you copy it here?
[09/01 17:24:13 INFO:] Running with args Namespace(approx_ckpt_step_interval=None, ... ,[main.py: 452] [09/01 17:24:18 INFO:] Git diff saved to experiment_output/used_configs/ObjectNavThorPPO/2023-09-01_17-24-15 [runner.py: 890] [09/01 17:24:18 INFO:] Config files saved to experiment_output/used_configs/ObjectNavThorPPO/2023-09-01_17-24-15 [runner.py: 935] [09/01 17:24:18 INFO:] Using 1 train workers on devices (device(type='cuda', index=0),) [runner.py: 317] [09/01 17:24:19 INFO:] there are 1 belief models: ['single_belief'] [visual_nav_models.py: 116] [09/01 17:24:19 INFO:] Using local worker ids [0] (total 1 workers in machine 0) [runner.py: 326] [09/01 17:24:19 INFO:] Started 1 train processes [runner.py: 595] [09/01 17:24:19 INFO:] Using 1 valid workers on devices (device(type='cuda', index=1),) [runner.py: 317] [09/01 17:24:19 INFO:] Started 1 valid processes [runner.py: 622] [09/01 17:24:21 INFO:] valid 0 args [...][runner.py: 433] [09/01 17:24:21 INFO:] train 0 args [...] [runner.py: 416] [09/01 17:24:22 INFO:] there are 1 belief models: ['single_belief'] [visual_nav_models.py: 116] [09/01 17:24:22 INFO:] there are 1 belief models: ['single_belief'] [visual_nav_models.py: 116] [09/01 17:24:29 INFO:] Starting 0-th VectorSampledTask worker with args [...] [09/01 17:24:31 INFO:] Starting 0-th SingleProcessVectorSampledTasks generator with args [...] [09/01 17:24:31 INFO:] Starting 1-th VectorSampledTask worker with args [...] [09/01 17:24:33 INFO:] Starting 0-th SingleProcessVectorSampledTasks generator with args [...] [09/01 17:29:33 ERROR:] [train worker 0 ] Encountered TimeoutError , exiting. [engine.py: 1858] File "/allenact/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", Line 272,in read_with_timeout raise TimeError( TimeouError: Did not receive output from 'VectorSampledTask' worker for 300 seconds. [engine.py: 1861] [09/01 17:29:34 ERROR:] Encountered Exception. Terminating runner. [runner.py: 1467] [09/01 17:29:34 ERROR:] Traceback (most recent call last): File "/allenact/allenact/algorithms/onpolicy_sync/runner.py", line 1434, in log_and_close raise Exception( Exception: Train worker 0 abnormally terminated [runner.py: 1468] Traceback (most recent call last): File "/allenact/allenact/algorithms/onpolicy_sync/runner.py", line 1434, in log_and_close raise Exception( Exception: Train worker 0 abnormally terminated [09/01 17:29:34 INFO:] Terminating train 0 [runner.py: 1543] [09/01 17:29:34 INFO:] Terminating valid 0 [runner.py: 1543] [09/01 17:29:34 INFO:] Termination signal sent to worker Train-0. Worker Train-0 is already closed, exiting. [runner.py: 348] [09/01 17:29:34 INFO:] Joining train 0 [runner.py: 1543] [09/01 17:29:34 INFO:] Termination signal sent to worker Valid-0. Forcing worker Valid-0 to close and exiting. [runner.py: 353] [09/01 17:29:35 INFO:] Closed train 0 [runner.py: 1543] [09/01 17:29:35 INFO:] Joining valid 0 [runner.py: 1543] [09/01 17:29:35 INFO:] Closed valid 0 [runner.py: 1543]
If you do export ALLENACT_DEBUG_VST_TIMEOUT=1000
before calling the command you are currently using to start your experiment, does it also fail (just after a longer period of waiting)?
Changing the waiting time doesn't work.
Actually, export ALLENACT_DEBUG_VST_TIMEOUT=1000
can't change the waiting time, it is still 300 seconds.
So I made the change in https://github.com/allenai/allenact/blob/9772eeeb7eacc1f9a83c90d1cf549a3f7e783c12/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py#L237
and I still get the same error, only the waiting time has changed.
I assume at this point you must have already tried starting a standalone THOR controller to ensure everything is correctly installed, but just in case you haven't, can you try to run a script like:
from ai2thor.platform import CloudRendering
from ai2thor.controller import Controller
import cv2
c = Controller(platform=CloudRendering, gpu_device=0)
cv2.imwrite("/path/to/debug_output_image.png", c.last_event.frame[:,:,::-1])
c.stop()
?
The new code install the thor-CloudRendering platform and come a new issue, i meet the issue before when i run the PointNav task with command PYTHONPATH=. python allenact/main.py training_a_pointnav_model -o storage/robothor-pointnav-rgb-resnet-resnet -b projects/tutorials
.
issue: RuntimeError: vulkaninfo failed to run, please ask your administrator to install vulkaninfo
(e.g. on Ubuntu systems this requires running sudo apt install vulkan-tools
).
But when i run the command sudo apt install vulkan-tools
,
the server can't locate the package vulkan-tools
After using the sudo apt-get update,
it still doesn't work.
I installed the same environment on my PC according to the tutorial (ubuntu18.04), both PointNav Task and ObjectNav Task have no problem.
https://packages.ubuntu.com/search?keywords=vulkan-tools has a list of packages for different Ubuntu versions. It's possible that third parties provide vulkan-tools
for other/older versions.
It sounds like this is out-of-scope for AllenAct, so I'm closing the issue.
When I run command
PYTHONPATH=. python allenact/main.py training_a_pointnav_model -o storage/robothor-pointnav-rgb-resnet-resnet -b projects/tutorials
on a remote server with an attached display, I get error
Exception: The following builds were found, but had missing dependencies. Only one valid platform is required to run AI2-THOR. Platform Linux64 failed validation with the following errors: Invalid display: :0.0. Failed to connect Can't connect to display ":0.0": b'No protocol specified\n'
Linux64 requires a X11 server to be running with GLX. The following valid displays were found :13.0
How can I solve this issue? plz help me, thanks!