dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
276 stars 132 forks source link

frontend - TAPE REST API: Transient `SSL handshake failed: sslv3 alert certificate unknown` errors #7597

Open vingar opened 2 weeks ago

vingar commented 2 weeks ago

Hello,

We observed some mass transient SSL errors when FTS queries the status of staging requests against the frontend servers:

STAGING [13] [Tape REST API] Stage pooling call failed: (Neon): SSL handshake failed: sslv3 alert certificate unknown

See. https://fts.usatlas.bnl.gov:8449/fts3/ftsmon/#/job/fd576cd4-2c56-11ef-8623-00163e1051a4

It seems to correspond to an error when the server fails to validate the client certificate and its certification authority. These servers also host gPlazma, and we did not observe any authentication failures at that time. We are thinking that it might correspond to an issue when the CRLs are renewed and reloaded on the frontends.

This error message could be reproduced by having an empty /etc/grid-security/certificates directory on the frontend with the python code below.

Any help appreciated.

#!/usr/bin/env python3

import requests
id = "32485037-df6d-4c96-ab79-c409e0e2f238"
url = f'https://dcint-frontend001.sdcc.bnl.gov:3880/api/v1/tape/stage/{id}'
headers = {'Content-Type': 'application/json'}
cert_path = '/tmp/x509up_u0'
response = requests.get(url, headers=headers, cert=(cert_path, cert_path), verify='/etc/grid-security/certificates')
print(response.text)
print(response)
requests.exceptions.SSLError: HTTPSConnectionPool(host='dcint-frontend001.sdcc.bnl.gov', port=3880): Max retries exceeded with url: /api/v1/tape/stage/27ec6771-7d28-483d-97e6-99e2df30f959 (Caused by SSLError(SSLError(1, '[SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:877)'),))
kofemann commented 2 weeks ago

do you run fetch-crl?

DmitryLitvintsev commented 2 weeks ago

Vincent:

$ cat rest.py
#!/usr/bin/env python3

import os
import requests

id = "32485037-df6d-4c96-ab79-c409e0e2f238"
url = f'https://cmsdcatape.fnal.gov:3880/api/v1/tape/stage/{id}'
headers = {'Content-Type': 'application/json'}
uid = os.getuid()
cert_path = f'/tmp/x509up_u{uid}'
response = requests.get(url, headers=headers, cert=(cert_path, cert_path), verify='/etc/grid-security/certificates')
print(response.text)
print(response)

running withot voms proxy:

$ python3 rest.py
Traceback (most recent call last):
  File "/home/litvinse/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 710, in urlopen
    chunked=chunked,
  File "/home/litvinse/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/home/litvinse/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/home/litvinse/.local/lib/python3.6/site-packages/urllib3/connection.py", line 424, in connect
    tls_in_tls=tls_in_tls,
  File "/home/litvinse/.local/lib/python3.6/site-packages/urllib3/util/ssl_.py", line 450, in ssl_wrap_socket
    sock, context, tls_in_tls, server_hostname=server_hostname
  File "/home/litvinse/.local/lib/python3.6/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
  File "/usr/lib64/python3.6/ssl.py", line 365, in wrap_socket
    _context=self, _session=session)
  File "/usr/lib64/python3.6/ssl.py", line 776, in __init__
    self.do_handshake()
  File "/usr/lib64/python3.6/ssl.py", line 1036, in do_handshake
    self._sslobj.do_handshake()
  File "/usr/lib64/python3.6/ssl.py", line 648, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:877)

running with voms proxy:

$ voms-proxy-info 
subject   : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Dmitry Litvintsev/CN=UID:litvinse/CN=4175574056
issuer    : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Dmitry Litvintsev/CN=UID:litvinse
identity  : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Dmitry Litvintsev/CN=UID:litvinse
type      : RFC compliant proxy
strength  : 2048 bits
path      : /tmp/x509up_u8637
timeleft  : 119:58:03
$ python3 rest.py
{"detail":"request 32485037-df6d-4c96-ab79-c409e0e2f238 not found","title":"Not Found","status":"404"}
<Response [404]>
vingar commented 2 weeks ago

do you run fetch-crl?

Every 6 hours.

vingar commented 2 weeks ago

@DmitryLitvintsev

ssl.SSLError: [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:877)

Good catch. The error might indicate something on the client side as well, making it harder to debug..

vingar commented 2 weeks ago

What can be the corresponding error messages for such SSL errors in the dCache domain logs or access logs on the door and frontend?

kofemann commented 1 week ago

As we understand from the discussion at Tier-1 support meeting, the certificate directory temporarily becomes empty. Can you configure that?

DmitryLitvintsev commented 1 week ago

Yes, also, may I ask you how you update certificates? On our system we have never seen any issues.

# ls -al /etc/grid-security/
total 7616
drwxr-xr-x    5 root root    4096 Jun 20 12:14 .
drwxr-xr-x. 141 root root   12288 Jun 25 08:01 ..
lrwxrwxrwx    1 root root      21 Jun 20 11:44 certificates -> certificates-1.119NEW
drwxr-xr-x    2 root root   40960 Jun 25 11:45 certificates-1.119NEW
...

The /etc/grid-security/security is a soft link to /etc/grid-security/certificates-1.119NEW The CRLs are updated by cron:

10 * * * *    root    [ ! -f /var/lock/subsys/osg-update-certs-cron ] ||  /usr/sbin/osg-update-certs --random-sleep 2700 --called-from-cron > /dev/null 2>&1

provided by osg-ca-scripts package. It works like so: it creates a new directory, fills it up, and then moves symbolic link to it, then it removes old directory which is no longer visible to applications. It never failed to work with dCache.

vingar commented 1 week ago

We have some updates:

cfgamboa commented 1 week ago

As we understand from the discussion at Tier-1 support meeting, the certificate directory temporarily becomes empty. Can you configure that?

To clarify the cert directory has not been observed empty during issue time or after. In the meeting we were discussing possible scenarios where the CRL availability is compromised.

vingar commented 1 week ago

Yes, also, may I ask you how you update certificates? On our system we have never seen any issues.

# ls -al /etc/grid-security/
total 7616
drwxr-xr-x    5 root root    4096 Jun 20 12:14 .
drwxr-xr-x. 141 root root   12288 Jun 25 08:01 ..
lrwxrwxrwx    1 root root      21 Jun 20 11:44 certificates -> certificates-1.119NEW
drwxr-xr-x    2 root root   40960 Jun 25 11:45 certificates-1.119NEW
...

The /etc/grid-security/security is a soft link to /etc/grid-security/certificates-1.119NEW The CRLs are updated by cron:

10 * * * *    root    [ ! -f /var/lock/subsys/osg-update-certs-cron ] ||  /usr/sbin/osg-update-certs --random-sleep 2700 --called-from-cron > /dev/null 2>&1

provided by osg-ca-scripts package. It works like so: it creates a new directory, fills it up, and then moves symbolic link to it, then it removes old directory which is no longer visible to applications. It never failed to work with dCache.

There is no symlink pointing on /etc/grid-security/certificates and fetch-crls runs directly against /etc/grid-security/certificates. Thanks for sharing your configuration