Package 'setuptools' requires a different Python: 3.5.2 not in '>=3.6'

TopCoder2K commented 2 years ago

I'm trying to setup Alfred using docker. Steps to reproduce:

 git clone https://github.com/askforalfred/alfred.git alfred
 export ALFRED_ROOT=$(pwd)/alfred
 cd $ALFRED_ROOT
 python scripts/docker_build.py

The last command fails with

ERROR: Package 'setuptools' requires a different Python: 3.5.2 not in '>=3.6'
WARNING: You are using pip version 19.3.1; however, version 20.3.4 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
The command '/bin/sh -c pip install -U setuptools' returned a non-zero code: 1

TopCoder2K commented 2 years ago

Would replacing pip install --upgrade pip==19.3.1 with pip install --upgrade pip==20.3.4 in the Dockerfile be a good solution for this? At least the image was built successfully :)

TopCoder2K commented 2 years ago

Maybe I should open another issue but as I'm using pip=20.3.4 now, I'll post it here. After building the image I entered the following commands:

python scripts/docker_run.py --headless
tmux new -s startx
sudo nvidia-xconfig -a --use-display-device=None --virtual=1280x1024
sudo python ~/alfred/scripts/startx.py 0

And sudo python ~/alfred/scripts/startx.py 0 fails with

(EE) 
Fatal server error:
(EE) no screens found(EE) 
(EE) 
Please consult the The X.Org Foundation support 
         at http://wiki.x.org
 for help. 
(EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
(EE) 
(EE) Server terminated with error (1). Closing log file

Why is this?.. (Attached Xorg logs). I'm searching on the Internet and the reasons are always very different. It might be worth installing nvidia modules that Xorg is trying to load. Xorg.0.log

MohitShridhar commented 2 years ago

@TopCoder2K thanks for the tip on setuptools. I'll look into this.

Regarding (EE) no screens found(EE), I am not sure what is happening here. Have you tried startx.py 1 or >1? You might have install extra packages depending on your hardware setup.

TopCoder2K commented 2 years ago

@MohitShridhar, thank you for responding quickly!

Have you tried startx.py 1 or >1?

Yes, I tried startx.py 1 and startx.py 2, the error remains.

You might have install extra packages depending on your hardware setup.

Hmm, interesting... Have you encountered any specific cases?

MohitShridhar commented 2 years ago

Actually, I am not sure because the default docker setup works on my machine.

Maybe this might help: https://askubuntu.com/questions/1213538/ee-no-screens-found-when-startx ?

TopCoder2K commented 2 years ago

Maybe this might help: https://askubuntu.com/questions/1213538/ee-no-screens-found-when-startx ?

Thank you for the link! Yeah, I've come across this, that's why I mentioned installing the modules that Xorg is trying to load (by the way, module "fbdev" was loaded successfully). And I'll try to do it, but first I want to ask what is the output of sudo nvidia-xconfig -a --use-display-device=None --virtual=1280x1024 on your machine? I have this, and it's strange:

pchelintsev@neurosymbolic-panov3:~$ sudo nvidia-xconfig -a --use-display-device=None --virtual=1280x1024
[sudo] password for pchelintsev: 

WARNING: Unable to locate/open X configuration file.

Package xorg-server was not found in the pkg-config search path.
Perhaps you should add the directory containing `xorg-server.pc'
to the PKG_CONFIG_PATH environment variable
No package 'xorg-server' found
New X configuration file written to '/etc/X11/xorg.conf'

It was mentioned here that "It will complain if you have not had an Xorg config file" but "No package 'xorg-server' found" looks more like an error than a complaint.

UPD 1.

According to https://www.x.org/releases/current/doc/man/man5/xorg.conf.5.xhtml and the contents of /etc/X11/xorg.conf, Xorg is trying to load nvidia drivers for my 2 GPUs on which 2 screens will run. I came across an interesting fact here:

Xorg searches for installed drivers automatically:

 -  If it cannot find the specific driver installed for the hardware (listed below), it first searches for fbdev ([xf86-video-fbdev](https://archlinux.org/packages/?name=xf86-video-fbdev)).
 -  If that is not found, it searches for vesa ([xf86-video-vesa](https://archlinux.org/packages/?name=xf86-video-vesa)), the generic driver, which handles a large number of chipsets but does not include any 2D or 3D acceleration.
 -  If vesa is not found, Xorg will fall back to [kernel mode setting](https://wiki.archlinux.org/title/Kernel_mode_setting), which includes GLAMOR acceleration (see [modesetting(4)](https://man.archlinux.org/man/modesetting.4)).

Although Ubuntu 16.04 is used inside the container, I don't think the behaviour is much different. The fact also partially explains "Failed to load module 'vesa'" error. Buuut I've checked that I have nvidia driver installed:

pchelintsev@neurosymbolic-panov3:~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  470.63.01  Tue Aug  3 20:44:16 UTC 2021
GCC version:

So, why doesn't Xorg server see my nvidia drivers?(

UPD 2.

I also thought that my 470-driver is too new for cuda 9.0, which is inside the container, and tried to install this one:

pchelintsev@neurosymbolic-panov3:~$ sudo apt-get install nvidia-driver-384     
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package nvidia-driver-384

but I ended up with a strange error...

TopCoder2K commented 2 years ago

I also decided to try setting up Alfred on a local computer with a physical display. When I ran sudo python3 scripts/docker_build.py, I got:

Step 8/28 : RUN useradd -ms /bin/bash $USER_NAME
 ---> Running in 8c2872dcd725
useradd: user 'root' already exists
The command '/bin/sh -c useradd -ms /bin/bash $USER_NAME' returned a non-zero code: 9

So, I just commented the line in the Dockerfile (it seems that the problem has occurred because I have only the root user on my computer). And I also changed pip==19.3.1 to pip==20.3.4. After this the image was built successfully.

Then I tried to run the docker container. I got

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
Executed with code  32000
non-network local connections being removed from access control list

but solved it with proper nvidia-container installation. After this I ran source /home/root/alfred_env/bin/activate, cd /home/$ALFRED_ROOT and python scripts/check_thor.py. The latter produced:

libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast
Found path: /root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64
Mono path[0] = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Managed'
Mono config path = '/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Mono/etc'
Preloaded 'ScreenSelector.so'
Display 0 'HP 24m 24"': 1920x1080 (primary device).
Logging to /root/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/home/root/alfred_env/lib/python3.5/site-packages/ai2thor/controller.py", line 697, in _start_unity_thread
    raise Exception("command: %s exited with %s" % (command, returncode))
Exception: command: ['/root/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64', '-screen-fullscreen', '0', '-screen-quality', '7', '-screen-width', '300', '-screen-height', '300'] exited with 1

which I haven't been able to handle yet. I saw https://github.com/askforalfred/alfred/issues/49 but since I don't use startx.py and I have libGL error, the discussions there are not really relevant, are they?

UPD from April 18th

I've created docker group, so I don't need to use sudo anymore. So, I was able to completely repeat all the commands with almost the original Dockerfile (except pip=20.3.4, of course), but the error remains. Also, I looked into the logs:

Desktop is 1920 x 1080 @ 60 Hz
Unable to find a supported OpenGL core profile
Failed to create valid graphics context: please ensure you meet the minimum requirements
E.g. OpenGL core profile 3.2 or later for OpenGL Core renderer
Vulkan detection: 0
No supported renderers found, exiting

(Filename:  Line: 634)

So, there are some problems with renders but all the libs should have been installed in the image, right? Why is this? Also, I checked that OpenGL is installed on my local computer:

svyatoslav@svyatoslav-desktop ~/I/E/alfred (master)> glxinfo | grep "version"                                                                                                                       (base) 
server glx version string: 1.4
client glx version string: 1.4
GLX version: 1.4
OpenGL core profile version string: 4.6.0 NVIDIA 470.103.01
OpenGL core profile shading language version string: 4.60 NVIDIA
OpenGL version string: 4.6.0 NVIDIA 470.103.01
OpenGL shading language version string: 4.60 NVIDIA
OpenGL ES profile version string: OpenGL ES 3.2 NVIDIA 470.103.01
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

TopCoder2K commented 2 years ago

@MohitShridhar, I've updated the previous comments. I still can't solve (EE) no screens found(EE), probably it's connected with my nvidia drivers. And do you have any ideas why OpenGL isn't found inside a docker container running on my local computer with a monitor? I've googled something, but I don't have /usr/lib/nvidia-[version_number] folder

UPD 1

Here it's suggested to use another base container. So, I tried to use FROM nvidia/cudagl:9.0-devel-ubuntu16.04 and install libcudnn (I attached my Dockerfile) and it worked, although I got some errors... Are they serious? It looks like they are connected to the audio card...

(alfred_env) svyatoslav@svyatoslav-desktop:~/alfred$ python scripts/check_thor.py
Found path: /home/svyatoslav/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64
Mono path[0] = '/home/svyatoslav/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Managed'
Mono config path = '/home/svyatoslav/.ai2thor/releases/thor-201909061227-Linux64/thor-201909061227-Linux64_Data/Mono/etc'
Preloaded 'ScreenSelector.so'
Display 0 'HP 24m 24"': 1920x1080 (primary device).
Logging to /home/svyatoslav/.config/unity3d/Allen Institute for Artificial Intelligence/AI2-Thor/Player.log
ALSA lib confmisc.c:768:(parse_card) cannot find card '0'
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1251:(snd_func_refer) error evaluating name
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:4771:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM default
ALSA lib confmisc.c:768:(parse_card) cannot find card '0'
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1251:(snd_func_refer) error evaluating name
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:4771:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM default
(300, 300, 3)
Everything works!!!

Dockerfile(copy).txt

MohitShridhar commented 2 years ago

@TopCoder2K, this looks like good progress. If you are seeing

(300, 300, 3)
Everything works!!!

that means it's good to go. Try evaluating a pre-trained model. It should work, I think.

And thanks for the Docker fixes. I'll look into these when I get some time.

TopCoder2K commented 2 years ago

@MohitShridhar, I tried to run the evaluation, but got AttributeError: 'Namespace' object has no attribute 'use_templated_goals'. It seems that it's not related to the changes I committed to the Dockerfile.

Steps to reproduce:

Run the ALFRED docker container (I used the image with OpenGL: python scripts/docker_run.py --image svyatoslav-alfred-gl --container alfred-gl)
source ~/alfred_env/bin/activate
cd $ALFRED_ROOT
python models/eval/eval_seq2seq.py --model_path data/checkpoints/seq2seq_pm_chkpt/model\:seq2seq_im_mask\,name\:base30_pm010_sg010_01/best_seen.pth --eval_split valid_seen --data data/json_feat_2.1.0 --model models.model.seq2seq_im_mask --gpu --num_threads 1 --preprocess (I downloaded the pretrained model and also added --preprocess flag and set --num_threads 1 as noted here).

The result is:

{'tests_seen': 1533,
 'tests_unseen': 1529,
 'train': 21023,
 'valid_seen': 820,
 'valid_unseen': 821}
Loading:  data/checkpoints/seq2seq_pm_chkpt/model:seq2seq_im_mask,name:base30_pm010_sg010_01/best_seen.pth

Preprocessing dataset and saving to pp folders ... This is will take a while. Do this once as required:
Preprocessing valid_seen
  0% (0 of 820) |                                                                                                                                                   | Elapsed Time: 0:00:00 ETA:  --:--:--Traceback (most recent call last):
  File "models/eval/eval_seq2seq.py", line 55, in <module>
    eval = EvalTask(args, manager)
  File "/home/svyatoslav/alfred/models/eval/eval.py", line 45, in __init__
    dataset.preprocess_splits(self.splits)
  File "/home/svyatoslav/alfred/data/preprocess.py", line 67, in preprocess_splits
    use_templated_goals = self.args.use_templated_goals and train_mode # templated goals are not available for the test set
AttributeError: 'Namespace' object has no attribute 'use_templated_goals'
100% (820 of 820) |#################################################################################################################################################| Elapsed Time: 0:00:00 Time:  0:00:00

Moreover,

python models/train/train_seq2seq.py --data data/json_feat_2.1.0 --model seq2seq_im_mask --dout exp/model:seq2seq_im_mask,name:pm_and_subgoals_01 --splits data/splits/oct21.json --gpu --batch 1 --pm_aux_loss_wt 0.1 --subgoal_aux_loss_wt 0.1 --preprocess

successfully starts the training (preprocessing finishes without errors, and I see the training progress). Why is this?

UPD 1

It seems that the pretrained model doesn't have that flag inside save["args"]. We can see that Dataset is created using self.model.args which are taken from the checkpoint.

MohitShridhar commented 2 years ago

@TopCoder2K, as noted by the comment# templated goals are not available for the test set, templated goals are not available for the test set. The ALFRED challenge involves grounding human-annotated goals. But you are free to use templated goals during training time for augmentation etc.

TopCoder2K commented 2 years ago

@MohitShridhar, does this also apply to the valid_seen which I'm using? I didn't use test set above. If templated goals are not available for the valid split, then how should I run evaluation with --preprocess? (https://github.com/askforalfred/alfred/tree/master/models#task-evaluation)

MohitShridhar commented 2 years ago

@TopCoder2K, yes. See this for an extended discussion: https://github.com/askforalfred/alfred/issues/71#issuecomment-807887118

TopCoder2K commented 2 years ago

@MohitShridhar, should "Note: If you are training and evaluating on different machines or if you just downloaded a checkpoint, you need to run eval with --preprocess once with the appropriate dataset path. Also, after a fresh-install, run with --num_threads 1 to allow the script to download the THOR binary" be fixed then? It's impossible to run evaluation with --preprocess flag.

TopCoder2K commented 2 years ago

@MohitShridhar, after weeks of desperate forum reading and endless attempts to find solutions, I've given up running with docker. The problem was the X-server couldn't find the nvidia module. Moreover, I wasn't able to find nvidia_drv.so manually in the docker container filesystem, so solutions like the third answer from here didn't help.

However, I noticed that nvidia_drv.so is presented on the server filesystem in /usr/lib/x86_64-linux-gnu/nvidia/xorg/, so I decided to try to run without a docker container. And it worked!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! By the way, I also noticed a drop in performance. I got on valid_seen

SR: 8/820 = 0.010
GC: 141/2109 = 0.067
PLW SR: 0.003
PLW GC: 0.038

vs

SR: 0.037
GC: 0.10

in the article.

I'm ready to close the issue as soon as we:

solve problems with pip: https://github.com/askforalfred/alfred/issues/111#issuecomment-1098149462
decide whether it's reasonable to change the base image: https://github.com/askforalfred/alfred/issues/111#issuecomment-1101358341
fix documentation: https://github.com/askforalfred/alfred/issues/111#issuecomment-1118438127

gautierdag commented 1 year ago

@TopCoder2K Did you ever manage to run Alfred on a headless server?

I've been trying to get ai2thor==2.1.0 running a SLURM node on which I don't have root privileges.

Xorg is out of the question because I don't have the privileges, but I have been able to use xvfb-run, though this crashes the UnityEngine somewhere.

The Unity error (in the Player.log):

SocketException: The socket has been shut down
  at System.Net.Sockets.Socket.Send (System.Byte[] buf) [0x00000] in <filename unknown>:0
  at AgentManager+<EmitFrame>c__Iterator3.MoveNext () [0x00000] in <filename unknown>:0
  at UnityEngine.SetupCoroutine.InvokeMoveNext (IEnumerator enumerator, IntPtr returnValueAddress) [0x00000] in <filename unknown>:0

(Filename:  Line: -1)

I haven't tried Docker yet, but would appreciate pointers if you have a working setup for a headless node or any recommendations.

Also, apologies for hijacking this old issue.

TopCoder2K commented 1 year ago

@gautierdag, I'm really sorry for my late reply. I've had tough weeks, I'm usually faster at answering... I hope you solved the problem or at least had a nice time :)

Did you ever manage to run Alfred on a headless server?

Yes, I did! Sorry, I haven't worked with SLURM yet, I've set up ALFRED on an ordinary server to which I connect via ssh.

The Unity error (in the Player.log):

Hmm, do not remember having the same error and using xvfb-run. At least in my current setup it is not used.

but would appreciate pointers if you have a working setup for a headless node or any recommendations

I've written a small guide in my native language and can translate it for you, but I used root privileges to install and start Xorg. If you already have Xorg running, then it looks like the root privileges are not required. Let me know if the guide is needed.

gautierdag commented 1 year ago

Yes please, I would love if you could link your guide! I can try to google translate my way around.

I have still not been able to get the headless setup to run, and even tried docker as well but also got stuck in a similar problem (cluster environment restricts Docker usage).

TopCoder2K commented 1 year ago

cluster environment restricts Docker usage

I'm afraid that it's not possible to launch the AI2THOR simulator without docker and sudo... But I could be wrong.

I can try to google translate my way around

The phrases are often too short and too specific, so I decided to translate by myself (commands are in italics): https://disk.yandex.ru/i/A1aX_I9O7r5Zpw Hope this helps and sorry for the bad look.)) And I hope you do not need a Yandex account to see this (let me know if you do, I'll paste the commands here).

gautierdag commented 1 year ago

@TopCoder2K Thank you - I can access it! I'll check it out and update if it works :)

askforalfred / alfred