AnarManafov / PoD

http://pod.gsi.de
GNU General Public License v2.0
4 stars 3 forks source link

Cannot setup PROOF via ssh #4

Closed ahazi137 closed 8 years ago

ahazi137 commented 8 years ago

Hi,

I am trying to set up a PROOF environment at our group's local workserver. I followed the instructions, but still have connection error and I have no idea why. At the moment I have 1 master and 1 worker node (for testing purposes, grid201(master), grid202 (worker)). After starting the pod-server I get this output:

pod-server start

Starting PoD server... updating xproofd configuration file... starting xproofd... starting PoD agent... preparing PoD worker package... selecting pre-compiled bins to be added to worker package...

PoD worker package: /home/proof_user/.PoD/wrk/PoDWorker.sh

XPROOFD [23486] port: 21002 PoD agent [23508] port: 22001

PROOF connection string: proof_user@grid201.kfki.hu:21002

2015-06-26 16:33:04.553 INF 0 [LOG singleton:thread-23885] LOG singleton has been initialized. 2015-06-26 16:33:04.553 INF 0 [PROOFAgent:thread-23885] pod-agent v.3.16 2015-06-26 16:33:04.553 INF 0 [CORE:thread-23885] Bringing >>> AgentServer <<< to life... 2015-06-26 16:33:04.553 INF 0 [CORE:thread-23885] Bringing >>> ThreadPool <<< to life... 2015-06-26 16:33:04.553 INF 0 [ThreadPool:thread-23887] starting a thread worker. 2015-06-26 16:33:04.553 INF 0 [ThreadPool:thread-23888] starting a thread worker. 2015-06-26 16:33:04.554 INF 0 [ThreadPool:thread-23889] starting a thread worker. 2015-06-26 16:33:04.554 INF 0 [ThreadPool:thread-23890] starting a thread worker. 2015-06-26 16:33:04.554 INF 0 [ThreadPool:thread-23891] starting a thread worker. 2015-06-26 16:33:04.554 INF 0 [AgentServer:thread-23885] Detected xpd [23863] on port 21002 2015-06-26 16:33:04.554 INF 0 [AgentServer:thread-23885] starting a monitor 2015-06-26 16:33:04.557 INF 0 [AgentServer:thread-23885] Entering into the main 'select' loop... 2015-06-26 16:34:30.740 INF 0 [AgentServer:thread-23885] Accepting the connetion from PoD UI: grid201.kfki.hu:43627 2015-06-26 16:34:30.740 INF 0 [AgentServer:thread-23885] Client requests a list of available workers. 2015-06-26 16:34:30.740 INF 0 [AgentServer:thread-23885] Client grid201.kfki.hu:43627 has just dropped the connection 2015-06-26 16:34:35.193 INF 0 [AgentServer:thread-23885] Accepting the connetion from PoD UI: grid201.kfki.hu:43628 2015-06-26 16:34:35.193 INF 0 [AgentServer:thread-23885] Client requests a list of available workers. 2015-06-26 16:34:35.193 INF 0 [AgentServer:thread-23885] Client grid201.kfki.hu:43628 has just dropped the connection

Do you have any idea what could be the issue here?

Best regards, Andras Hazi

AnarManafov commented 8 years ago

There two most common mistakes users do when using the ssh plug-in.

The ssh plug-in requires password less access on worker nodes, using ssh public key or an ssh agent or any other method. PoD doesn't create the root working directory on worker nodes, intentionally. PoD will create only subdirs. there.

ahazi137 commented 8 years ago

Hello,

Thank you for your response. Yes, I am aware of the mistakes you mentioned. I have passwordless ssh setting for proof_user between grid201 (master) and grid202 (worker), and also I've created the working directory too. Still there is something wrong with this setup. I am sending you my detailed working process below:

Master: grid201 Worker: grid202 User: proof_user Environment aliases: alias rootenv='source /opt/ROOT/root/bin/thisroot.sh' alias podenv='source /opt/PoD/3.16/PoD_env.sh' my cfg: [proof_user@grid201 ~]$ cat pod_ssh.cfg @bash_begin@

set environment

source /opt/ROOT/root/bin/thisroot.sh

@bash_end@

pw1, grid202.kfki.hu, , /home/proof_user, 1

[proof_user@grid201 ~]$ rootenv [proof_user@grid201 ~]$ podenv [proof_user@grid201 ~]$ pod-server start Starting PoD server... updating xproofd configuration file... starting xproofd... starting PoD agent... preparing PoD worker package... selecting pre-compiled bins to be added to worker package...

PoD worker package: /home/proof_user/.PoD/wrk/PoDWorker.sh

XPROOFD [12577] port: 21002 PoD agent [12599] port: 22001

PROOF connection string: proof_user@grid201.kfki.hu:21002

[proof_user@grid201 ~]$ pod-ssh -c pod_ssh.cfg submit --debug * [Mon, 31 Aug 2015 11:46:59 +0200] preparing PoD worker package... * [Mon, 31 Aug 2015 11:46:59 +0200] selecting pre-compiled bins to be added to worker package... * [Mon, 31 Aug 2015 11:46:59 +0200] PoD worker package: /home/proof_user/.PoD/wrk/PoDWorker.sh * [Mon, 31 Aug 2015 11:46:59 +0200] pod-ssh config contains an inline shell script. It will be injected it into wrk. package * [Mon, 31 Aug 2015 11:46:59 +0200] preparing PoD worker package... * [Mon, 31 Aug 2015 11:46:59 +0200] inline shell script is found and will be added to the package... * [Mon, 31 Aug 2015 11:46:59 +0200] selecting pre-compiled bins to be added to worker package... * [Mon, 31 Aug 2015 11:46:59 +0200] PoD worker package: /home/proof_user/.PoD/wrk/PoDWorker.sh * [Mon, 31 Aug 2015 11:46:59 +0200] There are 5 threads in the tread-pool. * [Mon, 31 Aug 2015 11:46:59 +0200] Number of PoD workers: 1 * [Mon, 31 Aug 2015 11:46:59 +0200] Number of PROOF workers: 1 * [Mon, 31 Aug 2015 11:46:59 +0200] Workers list: * [Mon, 31 Aug 2015 11:46:59 +0200] [pw1] with 1 workers at grid202.kfki.hu:/home/proof_user/pw1 pw1 [Mon, 31 Aug 2015 11:46:59 +0200] pod-ssh-submit-worker is started for grid202.kfki.hu (dir: /home/proof_user/pw1, nworkers: 1, sshopt: ) * [Mon, 31 Aug 2015 11:47:00 +0200]


Successfully processed tasks: 1 Failed tasks: 0


[proof_user@grid201 ~]$ pod-info -n 0

root [0] TProofBench pb(gSystem->GetFromPipe("pod-info -c")); Error in TGClient::TGClient: can't open display "grid202.kfki.hu:0.0", switching to batch mode... In case you run from a remote ssh session, reconnect with ssh -Y Starting master: opening connection ... Starting master: OK no resource currently available for this session: please retry later Error in TProof::StartSlaves: no resources available or problems setting up workers (check logs) Error in TProof::Open: new session could not be created Error in TProofBench::TProofBench: could not open a valid PROOF session - cannot continue

root [0] TProofBench pb(gSystem->GetFromPipe("pod-info -c")); sh: pod-info: command not found Error in TUnixSystem::GetFromPipe: command "pod-info -c" returned 32512 +++ Starting PROOF-Lite with 8 workers +++ Opening connections to workers: OK (8 workers) Setting up worker servers: OK (8 workers) PROOF set to parallel mode (8 workers) Run description: PROOF at , 8 workers Info in TProofBench::SetOutFile: using default output file: 'proofbench-grid202.kfki.hu-lite-8w-20150831-1150.root'

[proof_user@grid202 ~]$ cat /tmp/test/pw1/ssh_worker.log ./PoDWorker.sh: line 320: lockfile: command not found LC_PAPER=hu_HU.UTF-8 LC_ADDRESS=hu_HU.UTF-8 LC_MONETARY=hu_HU.UTF-8 SHELL=/bin/bash SSH_CLIENT=148.6.8.201 37421 22 LC_NUMERIC=hu_HU.UTF-8 QTDIR=/usr/lib64/qt-3.3 QTINC=/usr/lib64/qt-3.3/include USER=proof_user LC_TELEPHONE=hu_HU.UTF-8 MAIL=/var/mail/proof_user PATH=/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin LC_IDENTIFICATION=hu_HU.UTF-8 PWD=/tmp/test/pw1 LANG=en_US.UTF-8 MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles LOADEDMODULES= LC_MEASUREMENT=hu_HU.UTF-8 SHLVL=3 HOME=/home/proof_user LOGNAME=proof_user QTLIB=/usr/lib64/qt-3.3/lib CVS_RSH=ssh SSH_CONNECTION=148.6.8.201 37421 148.6.8.202 22 MODULESHOME=/usr/share/Modules LESSOPEN=||/usr/bin/lesspipe.sh %s LC_TIME=hu_HU.UTF-8 G_BROKEN_FILENAMES=1 LC_NAME=hu_HU.UTF-8 BASH_FUNCmodule()=() { eval /usr/bin/modulecmd bash $* } =/bin/env * [h, 31 aug 2015 11:15:17 +0200] +++ PoD Worker START +++ * [h, 31 aug 2015 11:15:17 +0200] Current working directory: /tmp/test/pw1 * [h, 31 aug 2015 11:15:17 +0200] Untar payload... xpd.cf PoD.cfg version server_info.cfg user_worker_env.sh pod-wrk-bin-3.16-Darwin-universal.tar.gz pod-wrk-bin-3.16-Linux-amd64.tar.gz pod-wrk-bin-3.16-Linux-x86.tar.gz * [h, 31 aug 2015 11:15:17 +0200] Sourcing a user defined environment script... * [h, 31 aug 2015 11:15:18 +0200] Current environment: LC_PAPER=hu_HU.UTF-8 MANPATH=/opt/ROOT/root:/usr/local/share/man:/usr/share/man/overrides:/usr/share/man/en:/usr/share/man LC_ADDRESS=hu_HU.UTF-8 LC_MONETARY=hu_HU.UTF-8 SHELL=/bin/bash SSH_CLIENT=148.6.8.201 37421 22 LC_NUMERIC=hu_HU.UTF-8 QTDIR=/usr/lib64/qt-3.3 QTINC=/usr/lib64/qt-3.3/include USER=proof_user LD_LIBRARY_PATH=/opt/ROOT/root/lib LC_TELEPHONE=hu_HU.UTF-8 LIBPATH=/opt/ROOT/root/lib POD_LOCATION=/tmp/test/pw1 MAIL=/var/mail/proof_user PATH=/opt/ROOT/root/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin LC_IDENTIFICATION=hu_HU.UTF-8 PWD=/tmp/test/pw1 LANG=en_US.UTF-8 MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles LOADEDMODULES= LC_MEASUREMENT=hu_HU.UTF-8 ROOTSYS=/opt/ROOT/root SHLVL=3 HOME=/home/proof_user DYLD_LIBRARY_PATH=/opt/ROOT/root/lib PYTHONPATH=/opt/ROOT/root/lib LOGNAME=proof_user QTLIB=/usr/lib64/qt-3.3/lib CVS_RSH=ssh SSH_CONNECTION=148.6.8.201 37421 148.6.8.202 22 MODULESHOME=/usr/share/Modules LESSOPEN=||/usr/bin/lesspipe.sh %s SHLIB_PATH=/opt/ROOT/root/lib LC_TIME=hu_HU.UTF-8 G_BROKEN_FILENAMES=1 LC_NAME=hu_HU.UTF-8 BASH_FUNCmodule()=() { eval /usr/bin/modulecmd bash $* } =/bin/env *\ [h, 31 aug 2015 11:15:18 +0200] host's CPU/instruction set: amd64 * [h, 31 aug 2015 11:15:18 +0200] PoD worker runs on Linux-x86_64 pod-agent v3.16 protocol: v6 Report bugs/comments to A.Manafov@gsi.de *\ [h, 31 aug 2015 11:15:18 +0200] using ROOTSYS: /opt/ROOT/root

Usage: xproofd [-b] [-c ] [-d] [-k {n|sz}] [-l ] [-L] [-n name] [-p ] [-P ] [-s pidfile] [-S site] [] mktemp: failed to create directory via template /PoDWorker_XXXXXXXXXX': Permission denied mktemp: failed to create directory via template/PoDWorker_XXXXXXXXXX': Permission denied chmod: missing operand after 777' Trychmod --help' for more information. * [h, 31 aug 2015 11:15:18 +0200] Attempt to start pod-agent (1 out of 3) * [h, 31 aug 2015 11:15:18 +0200] Attempt to start and detect xproofd (1 out of 10) * [h, 31 aug 2015 11:15:18 +0200] trying to use XPROOF port: 21001 * [h, 31 aug 2015 11:15:18 +0200] starting xproofd... Error: can't start xproofd. * [h, 31 aug 2015 11:15:18 +0200] Attempt to start and detect xproofd (2 out of 10) * [h, 31 aug 2015 11:15:18 +0200] trying to use XPROOF port: 21001 *\ [h, 31 aug 2015 11:15:18 +0200] starting xproofd...

(...)

Error: can't start xproofd. * [h, 31 aug 2015 11:15:19 +0200] starting pod-agent... * [h, 31 aug 2015 11:15:19 +0200] pod-agent is done, exit code: 100 * [h, 31 aug 2015 11:15:19 +0200] looks like xproofd has gone or has crashed... * [h, 31 aug 2015 11:15:19 +0200] --- DONE --- * [h, 31 aug 2015 11:15:19 +0200] Starting the cleaning procedure... * [h, 31 aug 2015 11:15:19 +0200] Gracefully shut down PoD worker process(es): 20137 *\ [h, 31 aug 2015 11:15:19 +0200] done cleaning up.

Remarks: I see these error messages in the log: mktemp: failed to create directory via template /PoDWorker_XXXXXXXXXX': Permission denied mktemp: failed to create directory via template/PoDWorker_XXXXXXXXXX': Permission denied chmod: missing operand after `777'

but I have no idea what could be the permission problem here. I tried to write into the working directory from master to worker and it worked. I was so desperate that I also tried to use the PoD via root user :) Here you can see the output:

[proof_user@grid201 ~]$ pod-info -n 1 [proof_user@grid201 ~]$ pod-info -l worker root@grid202.kfki.hu:21001 (direct connection) startup: 434s ( 7 minutes 14 seconds )

root [0] TProofBench pb(gSystem->GetFromPipe("pod-info -c")); Starting master: opening connection ... Starting master: OK
Opening connections to workers: OK (1 workers)
Note: File "iostream" already loaded 150701 15:45:29 3470 Proofx-I: Conn::Login: grid202.kfki.hu: CheckUser: 'root' logins not accepted 150701 15:45:29 3470 Proofx-E: Conn::GetAccessToSrv: client could not login at [grid202.kfki.hu:21001] 15:45:29 3470 Mst-0 | Warning in TProof::AddWorkers: worker '0.0' is invalid PROOF set to sequential mode Error in TProofBench::TProofBench: wrong max number of workers ('0')

So what am I doing wrong?

Best regards, Andras

AnarManafov commented 8 years ago

You probably need to define $TMPDIR on WNs. We used to have /tmp, but users complained and asked to use a dynamic dir path. So, since then we use TMPDIR to figure out location for PoD WN.

You can use this http://pod.gsi.de/doc/3.16/Configuration.html#users_env_script if needed to define additional environment variables.

Let me know whether this hint helps.

ahazi137 commented 8 years ago

Hi,

It seems including valid TMPDIR value solved the problem:

in pod_ssh.cfg I added the TMPDIR variable:

@bash_begin@

set environment

source /opt/ROOT/root/bin/thisroot.sh
export TMPDIR=$HOME/pw1

@bash_end@

pw1, grid202.kfki.hu, , /home/proof_user, 1

Now the master sees the worker and ROOT tests work also.

[proof_user@grid201 ~]$ pod-info -n 1 [proof_user@grid201 ~]$ pod-info -l worker proof_user@grid202.kfki.hu:21001 (direct connection) startup: 1s

Thank you very much for your help!

Best Regards, Andras

AnarManafov commented 8 years ago

no problem.

ping me if there will be other issues.