eu-emi / canl-java

Common authentication library (caNl), Java version
Other
14 stars 5 forks source link

Lock contention on OpensslCertChainValidator#validate #80

Closed gbehrmann closed 8 years ago

gbehrmann commented 8 years ago

We recently ported dCache to CANL and the first deployments are happening now. After NDGF upgraded we now see lock contention on OpensslCertChainValidator#validate:

"jetty-srm-8778" #8778 prio=5 os_prio=0 tid=0x00007fbe9c20b800 nid=0x4249 waiting for monitor entry [0x00007fbe0e3a1000]
java.lang.Thread.State: BLOCKED (on object monitor)
    at eu.emi.security.authn.x509.impl.OpensslCertChainValidator.validate(OpensslCertChainValidator.java:219)
    - waiting to lock <0x00000006c115d388> (a eu.emi.security.authn.x509.impl.OpensslCertChainValidator)
    at eu.emi.security.authn.x509.helpers.ssl.SSLTrustManager.checkIfTrusted(SSLTrustManager.java:66)
    at eu.emi.security.authn.x509.helpers.ssl.SSLTrustManager.checkClientTrusted(SSLTrustManager.java:51)
    at sun.security.ssl.AbstractTrustManagerWrapper.checkClientTrusted(SSLContextImpl.java:929)
    at sun.security.ssl.ServerHandshaker.clientCertificate(ServerHandshaker.java:1869)
    at sun.security.ssl.ServerHandshaker.processMessage(ServerHandshaker.java:230)
    at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979)
    at sun.security.ssl.Handshaker$1.run(Handshaker.java:919)
    at sun.security.ssl.Handshaker$1.run(Handshaker.java:916)
    at java.security.AccessController.doPrivileged(Native Method)
    at sun.security.ssl.Handshaker$DelegatedTask.run(Handshaker.java:1369)
    - locked <0x00000007ba473230> (a sun.security.ssl.SSLEngineImpl)
    at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.fill(SslConnection.java:613)
    - locked <0x00000007ba474fd0> (a org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint)
    at org.eclipse.jetty.server.HttpConnection.fillRequestBuffer(HttpConnection.java:315)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:223)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:261)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
    at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:192)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:261)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
    at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:75)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:213)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:147)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
    at java.lang.Thread.run(Thread.java:745)

We follow the advice of the manual: "Try to use as few instances as possible, typically one shared instance is enough. Of course all validators are thread safe.". Thus we only have one OpensslCertChainValidator per service and this becomes a contention point. We don't want to resort to one validator per connection as the overhead of reloading certificates all the time is bad too.

Is BCCertPathValidator not thread safe or why does OpensslCertChainValidator need to use such coarse grained locking? This looks to be a major show stopper for us.

golbi commented 8 years ago

No, the BCCertPathValidator should be thread safe as it is stateless. AFAIR the locking was done to have the correct behaviour in case of validator configuration changes. As nobody asked for improvements this was not optimized.

I think it shouldn't be that hard to introduce some RW locking. I'll try to look into at after finding few free cycles.

gbehrmann commented 8 years ago

We would very much appreciate that. Thank you.

golbi commented 8 years ago

Should be much better now. Is it possible that you test the current HEAD in your environment before an official release?

gbehrmann commented 8 years ago

Thanks! Unfortunately I cannot currently test it in any production like environment as we are limitted to CANL 1.3 due to BC dependencies - we are awaiting a release of opensciencegrid/privilege-xacml#9.

golbi commented 8 years ago

OK, as this is the most important thing to be included I think the release can wait. Please let me know after testing if you are happy with the current implementation.

andreaceccanti commented 8 years ago

Hi, Is this going to be backported to CANL 1.3 as well?

golbi commented 8 years ago

Hi,

No, pre 2 canl is in pure maintenance mode. However if you want to backport some patches - go on.

andreaceccanti commented 8 years ago

And what about 2.1.x?
EPEL 7 is still bc 1.50. Is this change already ported? If not, is in the plans?

Thanks

gbehrmann commented 8 years ago

For 2.1.x a backport would be useful for us (dCache) too, precisely due to the BC dependency and compatibility with the VOMS libraries.

golbi commented 8 years ago

Well, BC is an issue as its releases are not backwards compatible (neither ABI or API wise). So using it from system RPM is a big problem on its own: assuming that everything in a distro is updated is unrealistic and so old versions (insecure...) are still used and installed.

That said I can't offer releases for arbitrary version of BC. Next week I'll try to find some time to release 2.2.0. I'll check if backport to 2.1/BC1.50 is low effort and I can release this version if it is low hanging fruit. In each case - in Maven central. 1.x is no go; I'll rather work on 1.54 BC integration.

golbi commented 8 years ago

OK, the version 2.1.2 is being deployed. The backporting was not very conflicting however please treat it as an exception rather then a rule. Depending on a system BC is very problematic, I think it is needless to write why.

Development of canl 2.3.0 based on BC 1.54 is started.

andreaceccanti commented 8 years ago

Thanks!

gbehrmann commented 8 years ago

I absolutely agree with you on packaging of BC. In fact, I think packaging Java libraries as individual distro packages (DEB or RPM) is fundamentally flawed as most Java libraries do not keep the API sufficiently stable for that to work. For this reason we ship all dependencies as part of dCache - and for this reason dCache isn't available in e.g. Debian as they would force us to stop doing this.

However, as long as other libraries strive for inclusion in particular distros, we are bound to the version of CANL and BC they use. Eg. if the VOMS and Argus libraries use BC 1.50 and CANL 2.1, then we are sort of forced to do the same in dCache. Even worse, for other libraries also depending on BC we have to use old versions too, e.g. Apache SSHD. I don't see a better solution (short of maintaining our own fork of libraries).

golbi commented 8 years ago

@gbehrmann Well, if you use VOMS&Argus and system CANL through them, then of course there is no much freedom on your side. However push @andreaceccanti and co to go your way ;-) In UNICORE canl is included and we are free of this mess at all.

gbehrmann commented 8 years ago

We don't use system CANL or system BC (they are included directly in dCache), but the lack of VOMS and Argus versions for CANL 2.2 makes things difficult. 2.1 -> 2.2 and 1.50 -> 1.52 may be small enough that we can just force the newer versions through Maven, but that's not a general solution.

Just getting away from JGlobus, CANL 1.3 and BC 1.46 will already be a huge step forward, so that's our current goal :-)

andreaceccanti commented 8 years ago

@gbehrmann up to now we just had, as a requirement, to support the latest BC in EPEL 6 and 7. No problem in targeting also CANL 2.2, we'll have a 3.2 branch targeting that version.

I agree with you and @golbi that getting the latest BC version and embedding dependencies is the wiser approach, but we also have the clients distributed through EPEL to support...

Thanks

gbehrmann commented 8 years ago

@andreaceccanti, that would be excellent. Thanks.