OxalisCommunity / oxalis

Oxalis - PEPPOL Access Point open source implementation - Core component
Other
129 stars 91 forks source link

Lookup fails with "OCSP server is currently too busy" #460

Closed HarshaSuranjith closed 4 years ago

HarshaSuranjith commented 4 years ago

Number of requests are currently failing both in test and production due to following error.

no.difi.oxalis.api.lang.OxalisTransmissionException: OCSP server is currently too busy.
    at no.difi.oxalis.outbound.lookup.CachedLookupService.lookup(CachedLookupService.java:73)
    at no.difi.oxalis.api.lookup.LookupService.lookup(LookupService.java:57)
    at no.difi.oxalis.outbound.transmission.TransmissionRequestBuilder.build(TransmissionRequestBuilder.java:208)
    at com.compello.oxalisout.service.OxalisOutboundMessageDispatcher.dispatch(OxalisOutboundMessageDispatcher.java:67)
    at com.compello.oxalisout.OxalisOut.processFile(OxalisOut.java:103)
    at com.compello.oxalisout.OxalisOut.main(OxalisOut.java:52)
Caused by: no.difi.vefa.peppol.security.lang.PeppolSecurityException: OCSP server is currently too busy.
    at no.difi.vefa.peppol.security.util.DifiCertificateValidator.validate(DifiCertificateValidator.java:64)
    at no.difi.oxalis.commons.mode.OxalisCertificateValidator.perform(OxalisCertificateValidator.java:48)
    at no.difi.oxalis.commons.mode.OxalisCertificateValidator.validate(OxalisCertificateValidator.java:33)
    at no.difi.vefa.peppol.lookup.LookupClient.getEndpoint(LookupClient.java:105)
    at no.difi.vefa.peppol.lookup.LookupClient.getEndpoint(LookupClient.java:115)
    at no.difi.oxalis.outbound.lookup.CachedLookupService.load(CachedLookupService.java:79)
    at no.difi.oxalis.outbound.lookup.CachedLookupService.load(CachedLookupService.java:46)
    at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
    at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
    at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
    at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
    at com.google.common.cache.LocalCache.get(LocalCache.java:3953)
    at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3976)
    at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4960)
    at no.difi.oxalis.outbound.lookup.CachedLookupService.lookup(CachedLookupService.java:71)
    ... 5 common frames omitted
Caused by: no.difi.certvalidator.api.FailedValidationException: OCSP server is currently too busy.
    at no.difi.certvalidator.rule.OCSPRule.validate(OCSPRule.java:41)
    at no.difi.certvalidator.rule.AbstractRule.validate(AbstractRule.java:24)
    at no.difi.certvalidator.rule.HandleErrorRule.validate(HandleErrorRule.java:44)
    at no.difi.certvalidator.rule.AbstractRule.validate(AbstractRule.java:17)
    at no.difi.certvalidator.structure.AndJunction.validate(AndJunction.java:29)
    at no.difi.certvalidator.structure.AbstractJunction.validate(AbstractJunction.java:36)
    at no.difi.certvalidator.util.CachedValidatorRule.load(CachedValidatorRule.java:43)
    at no.difi.certvalidator.util.CachedValidatorRule.load(CachedValidatorRule.java:13)
    at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
    at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
    at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
    at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
    at com.google.common.cache.LocalCache.get(LocalCache.java:3953)
    at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3976)
    at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4960)
    at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4966)
    at no.difi.certvalidator.util.CachedValidatorRule.validate(CachedValidatorRule.java:30)
    at no.difi.certvalidator.util.CachedValidatorRule.validate(CachedValidatorRule.java:35)
    at no.difi.certvalidator.structure.AndJunction.validate(AndJunction.java:29)
    at no.difi.certvalidator.structure.AndJunction.validate(AndJunction.java:29)
    at no.difi.certvalidator.structure.AbstractJunction.validate(AbstractJunction.java:36)
    at no.difi.certvalidator.util.CachedValidatorRule.load(CachedValidatorRule.java:43)
    at no.difi.certvalidator.util.CachedValidatorRule.load(CachedValidatorRule.java:13)
    at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
    at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
    at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
    at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
    at com.google.common.cache.LocalCache.get(LocalCache.java:3953)
    at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3976)
    at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4960)
    at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4966)
    at no.difi.certvalidator.util.CachedValidatorRule.validate(CachedValidatorRule.java:30)
    at no.difi.certvalidator.util.CachedValidatorRule.validate(CachedValidatorRule.java:35)
    at no.difi.certvalidator.ValidatorGroup.validate(ValidatorGroup.java:79)
    at no.difi.certvalidator.ValidatorGroup.validate(ValidatorGroup.java:70)
    at no.difi.vefa.peppol.security.util.DifiCertificateValidator.validate(DifiCertificateValidator.java:62)
    ... 19 common frames omitted
Caused by: net.klakegg.pkix.ocsp.OcspException: OCSP server is currently too busy.
    at net.klakegg.pkix.ocsp.OcspResponse.verifyResponse(OcspResponse.java:36)
    at net.klakegg.pkix.ocsp.OcspClient.verify(OcspClient.java:65)
    at net.klakegg.pkix.ocsp.OcspClient.verify(OcspClient.java:44)
    at no.difi.certvalidator.rule.OCSPRule.validate(OCSPRule.java:34)
    ... 54 common frames omitted
senikk commented 4 years ago

The same issue is also reported at https://github.com/difi/Oxalis-AS4/issues/102

this issue is most likely related to DigiCerts OCSP server experiencing capacity problems. We have reached out to DigiCert to inform us about any problems on their side.

I did some manual OCSP querying and the service seems to be back online now, can you verify that you are able to send? I would suggest you inform your software vendor about the problems, maybe they could add a fallback mechanism for CRL bases revocation check when OCSP is down.

Its by the way the same line of code where its logged lots of problems with "not registered in SML"

OysteinLq commented 4 years ago

Further feedback from OpenPEPPOL:

We are in contact with DIGICERT who until now still confirms there is no problem with the OSCP Other APs are doing OSCP without problems Again Oxalis needs to create a fallback using CRL based lookups if OSCP is unavailable. This should be suggested to them and we will do that.

For the time being, we're not doing anything with these problems, except we've given some example timestamps of when it's failed for us.

kimlaurio commented 4 years ago

Well, obviously, it would be nice to have some kind of workaround/mitigation for this, perhaps with some configurable TTL, but the security implications are beyond me. Maybe all we can do is to wait for the OCSP servers to become better?

In the mean time, some reading for the curious ones: https://news.ycombinator.com/item?id=14375960 and the http://www.ccs.neu.edu/home/cbw/static/pdf/larisch-oakland17.pdf

senikk commented 4 years ago

Failures so far today on our accesspoint against all sent with latest version of Oxalis: "OCSP server is currently too busy" (CachedLookupService.java:73): 2.81% "not registered in SML" (CachedLookupService.java:73): 1.83% Other errors: 0.78%

kimlaurio commented 4 years ago

"OCSP server is currently too busy": 7% other: 0.08%

phax commented 4 years ago

Hi guys (sorry for cross-posting),

on behalf of OpenPEPPOL we've looked at this. We came to the conclusion that this must be a pure Oxalis issue and NOT a Digicert issue. Our assumption is, that Oxalis performs a remote query to the OCSP server for every transmission, and if many documents are sent in parallel that may lead to an overloaded OCSP server.

Here are some suggestions on what you can do to ease the pain:

BR, Philip

senikk commented 4 years ago

Looking at older logs I found similar problems in the period between 13.4.2019-15.4.2019 but nothing before and I can't see these errors before 29.2.2020 this year.

phax commented 4 years ago

Well, we assume that the OCSP addition is quite new to Oxalis, so maybe old AS2 solutions were simply not using this?

And this is the first real "beginning of the month" with AS4, where the majority of providers use the latest Oxalis version, so - at least to us - it is understandable why the issue didn't occur previously.

artjomsk commented 4 years ago

We've seen similar issues before (~ half year ago). https://github.com/difi/certvalidator is used which uses some @klakegg private repo

<!-- OCSP -->
<dependency>
     <groupId>net.klakegg.pkix</groupId>
     <artifactId>pkix-ocsp</artifactId>
     <version>0.9.0</version>
</dependency>
klakegg commented 4 years ago

@phax The certificate validation has been part of Oxalis since verison 4.0.0 RC2. I guess "quite new" is rather relative...

@artjomsk I guess public repos provided by @klakegg are just as private as those public repos provided by @phax.

phax commented 4 years ago

@klakegg do you mind to comment what you changed?

klakegg commented 4 years ago

@phax Upgraded to use VEFA Peppol 1.1.4.

phax commented 4 years ago

@klakegg what does that mean?

HarshaSuranjith commented 4 years ago

@klakegg how does upgrading dependencies solve the issue ?
does the new version implements the chaching suggested as a solution ?

kimlaurio commented 4 years ago

https://github.com/difi/vefa-peppol/releases/tag/1.1.4

HarshaSuranjith commented 4 years ago

@kimlaurio Thanks

senikk commented 4 years ago

I wonder what could be the real cause of this, it last for three days just like the last time in april last year. Have not seen any of these errors today.

aksamit commented 4 years ago

Dear all,

It seems that DigiCerts initial response to this matter was inaccurate, they have updated their investigation with the following:

It was discovered that a server within the cluster was acting inappropriately when receiving a verification check. It was removed from rotation, and the service is now working at normal capacity with no errors.

This explains why only a percentage of the requests were not able to be validated and why we had problems recreating the scenario in other environments.

We will follow up with DigiCert to ensure they have proper monitoring in place to easier identify situations like this if they would ever occur again.

Nevertheless, I think the following recommendations should be considered to make the implementation less sensitive for intermittent outages in the OCSP service. Please keep in mind that from a security perspective there is no actual impact on caching the results for a limited amount of time. Neither CRL or OCSP is updated instantly the moment a certificate is revoked. The process can take up to 12 - 24 hours from the moment a certificate is revoked until CRL or OCSP is updated with this information *(this has been communicated to me from DigiCert officials).

Here are some suggestions on what you can do to ease the pain:

  • Please cache OCSP results per X509 certificate. Check results SHOULD be cached up to 6 hours
  • Implement a fallback to CRL. CRL check results SHOULD be cached up to 24 hours.
  • If both fail, retry internally with a certain delay between the tries
  • Try sending less documents in parallel using Oxalis

BR, Philip

I hope this will shed some light on why this problem suddenly arose and as quietly disappeared without any real explanation.

Best regards,

Mikael Aksamit 2nd line support, OpenPEPPOL Operations