Different behavior between VOMS server failing WMAgent container run

amaltaro commented 3 months ago

Impact of the bug WMAgent

Describe the bug NOTE that I don't provide stdout details here intentionally, not to give voms server endpoint.

While testing WMAgent tag 2.3.4rc5, I noticed that running the container fails at the _renew_proxy step at Fermilab, while at CERN it works just well. The error message is solely:

...
Warning: your certificate and proxy will expire Sat Jun  1 14:26:23 2024
Creating proxy  Done

which is within the requested lifetime of the proxy
ERROR: _renew_proxy
Start sleeping now ...zzz...

The apparently problem is that we are using different VOMS servers between CERN and Fermilab agents, and one of them return exit code 0 while at FNAL it returns exit code 1, hence stopping the process of wmagent startup.

How to reproduce it Execute this command: https://github.com/dmwm/CMSKubernetes/blob/8bcd434/docker/pypi/wmagent/bin/manage-common.sh#L316

Expected behavior I would say we could relax the process of _renew_proxy during container startup, and not quit if any of those 2 commands return an exit code != 0.

Other alternative would, for instance, request FNAL admins to change the voms server to the same used at CERN - if possible.

Additional context and error message None

anpicci commented 3 months ago

@amaltaro I confirm I recently saw this error message from cmssrv810 node at FNAL. Maybe we can investigate if it is feasible applying the desired changes to the FNAL admins in an amount time decent for us?

khurtado commented 3 months ago

Note there was a new wlcg-voms-cms package version this month:

http://linuxsoft.cern.ch/wlcg/centos7/x86_64/

CERN and FNAL hosts are using the same VOMS server, while our containers based on cmsweb-base are using 2 other older servers.

More info:

We seem to install this package at some point in cmsweb-base: https://github.com/dmwm/CMSKubernetes/blob/169a72709365ad6e9d7acb8ad789880018f03fcc/docker/cmsweb-base/Dockerfile#L11

And copy the files from the package in dmwm-base, which is then inherited by others like wmagent: https://github.com/dmwm/CMSKubernetes/blob/169a72709365ad6e9d7acb8ad789880018f03fcc/docker/pypi/dmwm-base/Dockerfile#L10

The VOMS cms server list is different than in standard CERN or FNAL hosts. On the hosts, both CERN and FNAL redirect to only 1 server in particular (maybe due to the new update in the package above?).

Inside WMAgent docker, 2 other servers are used.

Note I am not sharing the output in order not to expose the specific VOMS server endpoints.

E.g.: Inside WMagent container:

$ ls /etc/vomses/

On lxplus: VOMSES dir:

[khurtado@lxplus946 ~]$ ls /etc/vomses/cms*

On FNAL HOST (e.g.: cmssrv810), single vomses configuration file:

[cmsdataops@cmssrv810 ~]$ cat /etc/vomses| grep cms

amaltaro commented 3 months ago

@khurtado I created this issue in our wmcore-docs repository: https://gitlab.cern.ch/dmwm/wmcore-docs/-/issues/2 just so we can better discuss and log our investigation. Do you think you could look into this?

About the cmsweb-base image, indeed this is something we need to discuss with Aroosha in the coming days.

khurtado commented 3 months ago

@amaltaro FNAL/CERN hosts use the voms-clients-java package. This provides voms-proxy-init3 (v3).

We have been using voms-proxy-init2 in WMAgent, which has some incompatibility with CERN IAM servers after their latest updates. Installing the v3 java based package fixes the issue with CERN VOMS servers (voms-proxy-init is automatically linked to use this version, so nothing else to do).

Here is a PR that fixes the issue. I am showing the output with exit code 0 from the VOMS server we had issues with on gitlab docs.

https://github.com/dmwm/CMSKubernetes/pull/1497 https://gitlab.cern.ch/dmwm/wmcore-docs/-/issues/2

todor-ivanov commented 3 months ago

@amaltaro @khurtado I personally, had bad experience with a previous version of this java package few years ago while working with the CRAB3 machines. But honestly I do not remember the details any more, so I could not express a strong opinion on the matter.

About a workaround of the problem at hand. Usually, upon retry with manage renew-proxy the operation succeeds (maybe because it retries with the other server). So if we are sure we target to avoid the installation of the java package, for the time being, we can do two things to assure the startup of the containers with a freshly renewed proxy. ( And I say freshly renewed at startup, because this error can be observed only if we try to renew an already existing proxy preserved at the host from a previous container, for the long term run of the container the cronjob is about to retry those operations and will eventually succeed)

First we can relax, as Alan suggests, the check of the exit code of the _renew_proxy call at this line: https://github.com/dmwm/CMSKubernetes/blob/169a72709365ad6e9d7acb8ad789880018f03fcc/docker/pypi/wmagent/init.sh#L676:
```
  (_renew_proxy)               || { err=$?; echo "ERROR: _renew_proxy"; exit $err ;} 
```
AND
Second we can retry the operation directly in the _renew_proxy function here: https://github.com/dmwm/CMSKubernetes/blob/169a72709365ad6e9d7acb8ad789880018f03fcc/docker/pypi/wmagent/bin/manage-common.sh#L322
```
$myproxyCmd && $vomsproxyCmd
```
like that:
```
($myproxyCmd && $vomsproxyCmd) || ($myproxyCmd && $vomsproxyCmd)
```

So the combination of these two measures will definitely fix the problem we currently have with this version of voms_proxy_init ... well.... when it comes to the future of tokens etc. I cannot predict if we are delaying a problem here actually or not. Because, honestly, I was left with the impression, once we move to tokens we should not bother any more for voms-proxy.... stuff etc. but maybe I was getting it wrong.

khurtado commented 3 months ago

@todor-ivanov Yes, to me this is a matter of whether we want to A) Upgrade and be up to date with the VOMS init client CERN/FNAL hosts use nowadays (version 3), but pay an increase in the WMAgent image size of 9-10%, or B) we would rather not increase the image size but work with a different VOMS client version than used in the hosts (version 2). Version 2 is updated frequently, but right now for example, 2.1.0 has a release candidate since a month ago, OSG updated it 2 weeks ago with the release candidate due to issues with CERN IAM servers, but Debian 11 may not until the final release. Migrating from Debian 11 to Alma9 would allow us to benefit from OSG handling of packages in these things, but I can't compare what the image size would be (and all the changes needed?) if we migrate the container to Alma9.

todor-ivanov commented 3 months ago

hi @khurtado Option A) is fine by me. We may still think about implementing those two improvements I've mentioned in my previous comment, though. Those would have positive effect, in parallel to the voms client version upgrade. But I'd let you and @amaltaro decide if it is needed or not.

dmwm / WMCore

Different behavior between VOMS server failing WMAgent container run #11999