dCache / dcache

dCache - a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods
https://dcache.org
284 stars 135 forks source link

doors timeout with OCSP #2353

Open calestyo opened 8 years ago

calestyo commented 8 years ago

As discussed just before at the workshop, the CERN OCSP server apparently broke, which lead recent enough dcache version to eventually fail their (at least) transfers, with the default of dcache.authn.ocsp-mode=IGNORE.

We saw these:

016-04-12 07:35:39+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Response has expired on: Tue Apr 12 06:35:23 CEST 2016
2016-04-12 07:35:51+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Response has expired on: Tue Apr 12 06:35:23 CEST 2016
2016-04-12 07:35:56+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Response has expired on: Tue Apr 12 06:35:23 CEST 2016
2016-04-12 07:36:03+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Response has expired on: Tue Apr 12 06:35:23 CEST 2016
2016-04-12 07:36:52+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Response has expired on: Tue Apr 12 06:35:23 CEST 2016
2016-04-12 07:37:02+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Response has expired on: Tue Apr 12 06:35:23 CEST 2016
2016-04-12 07:37:05+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Response has expired on: Tue Apr 12 06:35:23 CEST 2016
2016-04-12 07:37:07+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Response has expired on: Tue Apr 12 06:35:23 CEST 2016
2016-04-12 07:37:35+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Response has expired on: Tue Apr 12 06:35:23 CEST 2016
2016-04-12 07:37:39+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Response has expired on: Tue Apr 12 06:35:23 CEST 2016
2016-04-12 07:38:18+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Response has expired on: Tue Apr 12 06:35:23 CEST 2016
2016-04-12 07:38:19+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Response has expired on: Tue Apr 12 06:35:23 CEST 2016
2016-04-12 07:38:24+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Response has expired on: Tue Apr 12 06:35:23 CEST 2016
2016-04-12 07:38:25+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Response has expired on: Tue Apr 12 06:35:23 CEST 2016
2016-04-12 08:54:09+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Server returned HTTP response code: 503 for URL: http://ocsp.cern.ch/ocsp/MEswSTBHMEUwQzAJB
2016-04-12 08:54:12+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Server returned HTTP response code: 503 for URL: http://ocsp.cern.ch/ocsp/MEswSTBHMEUwQzAJB
2016-04-12 08:54:12+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Server returned HTTP response code: 503 for URL: http://ocsp.cern.ch/ocsp/MEswSTBHMEUwQzAJB
2016-04-12 08:54:15+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Server returned HTTP response code: 503 for URL: http://ocsp.cern.ch/ocsp/MEswSTBHMEUwQzAJB
2016-04-12 08:54:18+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: Server returned HTTP response code: 503 for URL: http://ocsp.cern.ch/ocsp/MEswSTBHMEUwQzAJB

and many more of these:

2016-04-12 08:55:42+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: connect timed out
2016-04-12 08:55:53+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: connect timed out
2016-04-12 08:56:03+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: connect timed out
2016-04-12 08:56:13+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: connect timed out
2016-04-12 08:56:23+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: connect timed out
2016-04-12 08:56:33+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: connect timed out
2016-04-12 08:56:43+02:00 (srm_lcg-lrz-srm) [] Problem loading OCSP from http://ocsp.cern.ch/ocsp: connect timed out
2

Maybe it's not because of something with the ignoring doesn't work per se, but as suspected by Gerd, rather because of the SRM overloads with threads trying to process the OCSP requests. That seems to be confirmed by NNN threads being killed when I just restarted our SRM:

2016-04-12 10:18:30+02:00 (System) [] Thread: jetty-srm-83181 [A--] (5) BLOCKED
2016-04-12 10:18:30+02:00 (System) []     eu.emi.security.authn.x509.helpers.pkipath.AbstractValidator.getTrustedIssuers(AbstractValidator.java:169)
2016-04-12 10:18:30+02:00 (System) []     org.dcache.util.CachingCertificateValidator.getTrustedIssuers(CachingCertificateValidator.java:138)
2016-04-12 10:18:30+02:00 (System) []     eu.emi.security.authn.x509.helpers.ssl.SSLTrustManager.getAcceptedIssuers(SSLTrustManager.java:84)
2016-04-12 10:18:30+02:00 (System) []     sun.security.ssl.AbstractTrustManagerWrapper.getAcceptedIssuers(SSLContextImpl.java:909)
2016-04-12 10:18:30+02:00 (System) []     sun.security.ssl.ServerHandshaker.clientHello(ServerHandshaker.java:961)
2016-04-12 10:18:30+02:00 (System) []     sun.security.ssl.ServerHandshaker.processMessage(ServerHandshaker.java:221)
2016-04-12 10:18:30+02:00 (System) []     sun.security.ssl.Handshaker.processLoop(Handshaker.java:979)
2016-04-12 10:18:30+02:00 (System) []     sun.security.ssl.Handshaker$1.run(Handshaker.java:919)
2016-04-12 10:18:30+02:00 (System) []     sun.security.ssl.Handshaker$1.run(Handshaker.java:916)
2016-04-12 10:18:30+02:00 (System) []     java.security.AccessController.doPrivileged(Native Method)
2016-04-12 10:18:30+02:00 (System) []     sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1369)
2016-04-12 10:18:30+02:00 (System) []     org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.fill(SslConnection.java:630)
2016-04-12 10:18:30+02:00 (System) []     org.eclipse.jetty.server.HttpConnection.fillRequestBuffer(HttpConnection.java:313)
2016-04-12 10:18:30+02:00 (System) []     org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:223)
2016-04-12 10:18:30+02:00 (System) []     org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
2016-04-12 10:18:30+02:00 (System) []     org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
2016-04-12 10:18:30+02:00 (System) []     org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:197)
2016-04-12 10:18:30+02:00 (System) []     org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
2016-04-12 10:18:30+02:00 (System) []     org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
2016-04-12 10:18:30+02:00 (System) []     org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:75)
2016-04-12 10:18:30+02:00 (System) []     org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
2016-04-12 10:18:30+02:00 (System) []     org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
2016-04-12 10:18:30+02:00 (System) []     java.lang.Thread.run(Thread.java:745)

Setting the option to IGNORE works around the issue.
gbehrmann commented 8 years ago

You probably meant dcache.authn.ocsp-mode defaults to IF_AVAILABLE.

As far as I can see in the code, the error is actually cached, so the repeated errors being logged will mostly be a cached eu-emi/canl-java error. The cache lifetime uses the internal default of 1 hour.

The reason why the SRM fails anyway is probably that initially and whenever the cache expires, all requests will block on the OCSP lookup until the TCP connect expires. Depending on the default, enough requests may pile up to cause queues to overflow and clients to time out. This is of course just a theory - I need to look at a heap dump to verify if the error is actually cached.

calestyo commented 8 years ago

You probably meant dcache.authn.ocsp-mode defaults to IF_AVAILABLE.

oopps... yes.. sure..

gbehrmann commented 8 years ago

Default changed in 9771c6fc3d6fc616b2eeaa3a2831580fad3cd5c9, but I will leave this issue open until we have submitted a request to caNl to implement background refresh.

calestyo commented 7 years ago

Anything new here? :-)