clarin-eric / linkchecker

GNU General Public License v3.0
0 stars 0 forks source link

Investigation on SSLHandshakeException #52

Closed wowasa closed 1 year ago

wowasa commented 1 year ago

The status stable shows SSLHandshakeExceptions which don't occur in browser requests (f.e. for https://dspace-clarin-it.ilc.cnr.it/repository/xmlui/bitstream/handle/20.500.11752/OPEN-531/derivational_db.zip?sequence=1 or https://dspace-clarin-it.ilc.cnr.it/repository/xmlui/bitstream/handle/20.500.11752/OPEN-530/IT-TB_PML_analytical-tectogrammatical.zip?sequence=1).

Hence the question is how we can avoid these exceptions to get a similar behavior as in the browser.

twagoo commented 1 year ago

If I'm not mistaken, SSLHandshakeExceptions are usually caused by a missing certificate. As far as I know, Java uses its own keystore for the certificates so it all depends on how up to date that is. I'm not aware of any method to keep it up to date with 'the browser' - keeping in mind that all browsers won't necessarily behave the same way either. I think that there are two things we can do:

  1. Monitor the occurrence of such errors and make sure that we get up to date with the latest Java package available to Alpine if they seem to go up.
  2. Do not classify (all?) cases of SSLHandshakeException as a broken link. Something to discuss in an upcoming meeting, I would say.
twagoo commented 1 year ago

Note: in general it's worth doing a manual check via https://www.ssllabs.com/ssltest if such an issue is encountered. In this case (report) there are chain issues with the certificate so it's not bad to highlight that somehow.

It also comes down to the fundamental question of what our 'benchmark' is. Is a link 'not broken' IFF it works in a browser? Would be good to make this explicit to ourselves and the users.

twagoo commented 1 year ago

One more comment: for documentation purposes maybe you could attach a few examples of exceptions with some level of detail. Usually a cause is described in the exception, which would help understand the underlying issue.

wowasa commented 1 year ago

Storm Crawler has a property http.trust.everything=true by default. As I understand it we shouldn't see this kind of exception

twagoo commented 1 year ago

Storm Crawler has a property http.trust.everything=true by default. As I understand it we shouldn't see this kind of exception

interesting. Can you post a full stack trace?

wowasa commented 1 year ago

PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

twagoo commented 1 year ago

that confirms the certification path issue. A full stack trace would be helpful to understand where the problem originates

wowasa commented 1 year ago

javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:131) at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:371) at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:314) at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:309) at java.base/sun.security.ssl.CertificateMessage$T12CertificateConsumer.checkServerCerts(CertificateMessage.java:654) at java.base/sun.security.ssl.CertificateMessage$T12CertificateConsumer.onCertificate(CertificateMessage.java:473) at java.base/sun.security.ssl.CertificateMessage$T12CertificateConsumer.consume(CertificateMessage.java:369) at java.base/sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:396) at java.base/sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:480) at java.base/sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:458) at java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:201) at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:172) at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1505) at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1420) at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:455) at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:426) at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436) at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384) at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:221) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:165) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:140) at com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol.getProtocolOutput(HttpProtocol.java:245) at eu.clarin.linkchecker.bolt.MetricsFetcherBolt$FetcherThread.run(MetricsFetcherBolt.java:579)

twagoo commented 1 year ago

Looks like support for that property http.trust.everything is implemented for OkHTTP:

https://github.com/DigitalPebble/storm-crawler/blob/e473dea993e6d3885deac60c7d12fbadc4b2992e/core/src/main/java/com/digitalpebble/stormcrawler/protocol/okhttp/HttpProtocol.java

But not for Apache HTTP client.

See search results

twagoo commented 1 year ago

Decision (from minutes doc):

wowasa commented 1 year ago

one addition: we decided to test com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol which ignores SSL issues. If it doesn't have in side effects on link checking (f.e. on returned metada) , it is only a configuration issue

wowasa commented 1 year ago

implemented in v. 3.0.4