Open calestyo opened 8 years ago
You probably meant dcache.authn.ocsp-mode defaults to IF_AVAILABLE.
As far as I can see in the code, the error is actually cached, so the repeated errors being logged will mostly be a cached eu-emi/canl-java error. The cache lifetime uses the internal default of 1 hour.
The reason why the SRM fails anyway is probably that initially and whenever the cache expires, all requests will block on the OCSP lookup until the TCP connect expires. Depending on the default, enough requests may pile up to cause queues to overflow and clients to time out. This is of course just a theory - I need to look at a heap dump to verify if the error is actually cached.
You probably meant dcache.authn.ocsp-mode defaults to IF_AVAILABLE.
oopps... yes.. sure..
Default changed in 9771c6fc3d6fc616b2eeaa3a2831580fad3cd5c9, but I will leave this issue open until we have submitted a request to caNl to implement background refresh.
Anything new here? :-)
As discussed just before at the workshop, the CERN OCSP server apparently broke, which lead recent enough dcache version to eventually fail their (at least) transfers, with the default of dcache.authn.ocsp-mode=IGNORE.
We saw these:
and many more of these:
Maybe it's not because of something with the ignoring doesn't work per se, but as suspected by Gerd, rather because of the SRM overloads with threads trying to process the OCSP requests. That seems to be confirmed by NNN threads being killed when I just restarted our SRM: