Closed amaltaro closed 3 months ago
@amaltaro I confirm I recently saw this error message from cmssrv810
node at FNAL. Maybe we can investigate if it is feasible applying the desired changes to the FNAL admins in an amount time decent for us?
Note there was a new wlcg-voms-cms package version this month:
http://linuxsoft.cern.ch/wlcg/centos7/x86_64/
CERN and FNAL hosts are using the same VOMS server, while our containers based on cmsweb-base are using 2 other older servers.
More info:
We seem to install this package at some point in cmsweb-base: https://github.com/dmwm/CMSKubernetes/blob/169a72709365ad6e9d7acb8ad789880018f03fcc/docker/cmsweb-base/Dockerfile#L11
And copy the files from the package in dmwm-base, which is then inherited by others like wmagent: https://github.com/dmwm/CMSKubernetes/blob/169a72709365ad6e9d7acb8ad789880018f03fcc/docker/pypi/dmwm-base/Dockerfile#L10
The VOMS cms server list is different than in standard CERN or FNAL hosts. On the hosts, both CERN and FNAL redirect to only 1 server in particular (maybe due to the new update in the package above?).
Inside WMAgent docker, 2 other servers are used.
Note I am not sharing the output in order not to expose the specific VOMS server endpoints.
E.g.: Inside WMagent container:
$ ls /etc/vomses/
On lxplus: VOMSES dir:
[khurtado@lxplus946 ~]$ ls /etc/vomses/cms*
On FNAL HOST (e.g.: cmssrv810), single vomses configuration file:
[cmsdataops@cmssrv810 ~]$ cat /etc/vomses| grep cms
@khurtado I created this issue in our wmcore-docs repository: https://gitlab.cern.ch/dmwm/wmcore-docs/-/issues/2 just so we can better discuss and log our investigation. Do you think you could look into this?
About the cmsweb-base image, indeed this is something we need to discuss with Aroosha in the coming days.
@amaltaro FNAL/CERN hosts use the voms-clients-java package. This provides voms-proxy-init3 (v3).
We have been using voms-proxy-init2 in WMAgent, which has some incompatibility with CERN IAM servers after their latest updates. Installing the v3 java based package fixes the issue with CERN VOMS servers (voms-proxy-init is automatically linked to use this version, so nothing else to do).
Here is a PR that fixes the issue. I am showing the output with exit code 0 from the VOMS server we had issues with on gitlab docs.
https://github.com/dmwm/CMSKubernetes/pull/1497 https://gitlab.cern.ch/dmwm/wmcore-docs/-/issues/2
@amaltaro @khurtado I personally, had bad experience with a previous version of this java package few years ago while working with the CRAB3 machines. But honestly I do not remember the details any more, so I could not express a strong opinion on the matter.
About a workaround of the problem at hand. Usually, upon retry with manage renew-proxy
the operation succeeds (maybe because it retries with the other server). So if we are sure we target to avoid the installation of the java package, for the time being, we can do two things to assure the startup of the containers with a freshly renewed proxy. ( And I say freshly renewed at startup, because this error can be observed only if we try to renew an already existing proxy preserved at the host from a previous container, for the long term run of the container the cronjob is about to retry those operations and will eventually succeed)
_renew_proxy
call at this line: https://github.com/dmwm/CMSKubernetes/blob/169a72709365ad6e9d7acb8ad789880018f03fcc/docker/pypi/wmagent/init.sh#L676:
(_renew_proxy) || { err=$?; echo "ERROR: _renew_proxy"; exit $err ;}
AND
_renew_proxy
function here: https://github.com/dmwm/CMSKubernetes/blob/169a72709365ad6e9d7acb8ad789880018f03fcc/docker/pypi/wmagent/bin/manage-common.sh#L322
$myproxyCmd && $vomsproxyCmd
like that:
($myproxyCmd && $vomsproxyCmd) || ($myproxyCmd && $vomsproxyCmd)
So the combination of these two measures will definitely fix the problem we currently have with this version of voms_proxy_init
... well.... when it comes to the future of tokens etc. I cannot predict if we are delaying a problem here actually or not. Because, honestly, I was left with the impression, once we move to tokens we should not bother any more for voms-proxy.... stuff etc. but maybe I was getting it wrong.
@todor-ivanov Yes, to me this is a matter of whether we want to A) Upgrade and be up to date with the VOMS init client CERN/FNAL hosts use nowadays (version 3), but pay an increase in the WMAgent image size of 9-10%, or B) we would rather not increase the image size but work with a different VOMS client version than used in the hosts (version 2). Version 2 is updated frequently, but right now for example, 2.1.0 has a release candidate since a month ago, OSG updated it 2 weeks ago with the release candidate due to issues with CERN IAM servers, but Debian 11 may not until the final release. Migrating from Debian 11 to Alma9 would allow us to benefit from OSG handling of packages in these things, but I can't compare what the image size would be (and all the changes needed?) if we migrate the container to Alma9.
hi @khurtado Option A) is fine by me. We may still think about implementing those two improvements I've mentioned in my previous comment, though. Those would have positive effect, in parallel to the voms client version upgrade. But I'd let you and @amaltaro decide if it is needed or not.
Impact of the bug WMAgent
Describe the bug NOTE that I don't provide stdout details here intentionally, not to give voms server endpoint.
While testing WMAgent tag
2.3.4rc5
, I noticed that running the container fails at the_renew_proxy
step at Fermilab, while at CERN it works just well. The error message is solely:The apparently problem is that we are using different VOMS servers between CERN and Fermilab agents, and one of them return exit code 0 while at FNAL it returns exit code 1, hence stopping the process of wmagent startup.
How to reproduce it Execute this command: https://github.com/dmwm/CMSKubernetes/blob/8bcd434/docker/pypi/wmagent/bin/manage-common.sh#L316
Expected behavior I would say we could relax the process of
_renew_proxy
during container startup, and not quit if any of those 2 commands return an exit code != 0.Other alternative would, for instance, request FNAL admins to change the voms server to the same used at CERN - if possible.
Additional context and error message None