cognitect-labs / aws-api

AWS, data driven
Apache License 2.0
731 stars 100 forks source link

Intermittent `SSLHandshakeException` when invoking the `DeleteMessage` SQS api #213

Closed dannyfreeman closed 2 years ago

dannyfreeman commented 2 years ago

Dependencies

{:deps {com.cognitect.aws/api       {:mvn/version "0.8.539"}
        com.cognitect.aws/endpoints {:mvn/version "1.1.12.181"}
        com.cognitect.aws/sqs        {:mvn/version "814.2.1053.0"}}}

Description with failing test case

When calling the DeleteMessage api in one of our deployed services, we occasionally get an :cognitect.anomalies/fault error. It does not happen every time, maybe once out of every couple hundred requests.

This is the code, there is not much to it

            (aws/invoke sqs {:op      :DeleteMessage
                             :request {:QueueUrl      queue-url
                                       :ReceiptHandle receipt}})

After posting this issue in the #aws channel on the clojurians slack, and some advice from Ghadi, we started calling invoke with this :retriable? argument

            (aws/invoke sqs {:op         :DeleteMessage
                             :request    {:QueueUrl      queue-url
                                          :ReceiptHandle receipt}
                             :retriable? (fn [{:cognitect.anomalies/keys [category message] :as response}]
                                           (or (retry/default-retriable? response)
                                               (and (= category :cognitect.anomalies/fault)
                                                    (= message "Abruptly closed by peer"))))})

This seems to have solved our issue, but we don't really know what the root cause of this issue is. We've never seen it happen with other aws endpoints, just :DeleteMessage. If it's a common issue, it would be nice if the cognitect aws api could some categorize this type of exception as retriable.

Stack traces

 {:cognitect.anomalies/category :cognitect.anomalies/fault,
  :cognitect.anomalies/message  "Abruptly closed by peer",
  :cognitect.http-client/throwable #error {:cause "Abruptly closed by peer"
  :via
 [{:type javax.net.ssl.SSLHandshakeException
   :message "Abruptly closed by peer"
   :at [org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint fill "SslConnection.java" 769]}]
   :trace
   [[org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint fill "SslConnection.java" 769]
    [org.eclipse.jetty.client.http.HttpReceiverOverHTTP process "HttpReceiverOverHTTP.java" 164]
    [org.eclipse.jetty.client.http.HttpReceiverOverHTTP receive "HttpReceiverOverHTTP.java" 79]
    [org.eclipse.jetty.client.http.HttpChannelOverHTTP receive "HttpChannelOverHTTP.java" 131]
    [org.eclipse.jetty.client.http.HttpConnectionOverHTTP onFillable "HttpConnectionOverHTTP.java" 172]
    [org.eclipse.jetty.io.AbstractConnection$ReadCallback succeeded "AbstractConnection.java" 311]
    [org.eclipse.jetty.io.FillInterest fillable "FillInterest.java" 105]
    [org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint onFillable "SslConnection.java" 555]
    [org.eclipse.jetty.io.ssl.SslConnection onFillable "SslConnection.java" 410]
    [org.eclipse.jetty.io.ssl.SslConnection$2 succeeded "SslConnection.java" 164]
    [org.eclipse.jetty.io.FillInterest fillable "FillInterest.java" 105]
    [org.eclipse.jetty.io.ChannelEndPoint$1 run "ChannelEndPoint.java" 104]
    [org.eclipse.jetty.util.thread.strategy.EatWhatYouKill runTask "EatWhatYouKill.java" 338]
    [org.eclipse.jetty.util.thread.strategy.EatWhatYouKill doProduce "EatWhatYouKill.java" 315]
    [org.eclipse.jetty.util.thread.strategy.EatWhatYouKill tryProduce "EatWhatYouKill.java" 173]
    [org.eclipse.jetty.util.thread.strategy.EatWhatYouKill run "EatWhatYouKill.java" 131]
    [org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread run "ReservedThreadExecutor.java" 409]
    [org.eclipse.jetty.util.thread.QueuedThreadPool runJob "QueuedThreadPool.java" 883]
    [org.eclipse.jetty.util.thread.QueuedThreadPool$Runner run "QueuedThreadPool.java" 1034]
    [java.lang.Thread run nil -1]]}}}
dchelimsky commented 2 years ago

Thanks for the input. This is complicated by the fact that the interpretation of SSLHandshakeException as fault happens outside aws-api, in the cognitect/http-client, and we're planning to support "bring your own http client" at some point in the future, which takes control over how exceptions are interpreted as anomalies further out of our hands.

You are using the right escape hatch in the way it is intended. At the very least, we should update the README to explain how this works and suggest using a custom retriable? function when you run into scenarios like this.

dannyfreeman commented 2 years ago

Thanks for the response! The retriable? workaround we have right now is a fine solution for us. I'm sure other people would appreciate having something about it in the README. The docstrings in the library were very helpful and pointed us in the right direction in that regard.

If you think it's worthwhile for me to keep chasing this down, is there a way I could raise this issue with the cognitect/http-client repository? I have no idea where it is hosted.

dchelimsky commented 2 years ago

Thanks for offering to help, but the cognitect/http-client is not hosted in a public repo. It's open source in that you can look at the source, but it is not open for contribution.

dchelimsky commented 2 years ago

Hey @dannyfreeman , I added a "retriable errors" section to the the README. I'm going to close this issue, but feel free to add comments here if you have any. We can always reopen it if it's useful.

dannyfreeman commented 2 years ago

@dchelimsky that extra info in the README looks great, thanks for updating it.

As an update, we started seeing this error when call other AWS APIs. Some specific ones are cloudwatch :PutMetricData and SNS :PublishBatch.

I think we may be seeing something related to this ticket here: https://github.com/cognitect-labs/aws-api/issues/127

Once we saw that it wasn't just isolated to SQS and the DeleteMessage endpoint we've overridden the default retriable? for all of our clients to check for this specific exception.

(defn- aws-ssl-ca-error?
  [{:cognitect.anomalies/keys [category message] :as resp}]
  (and (= category :cognitect.anomalies/fault)
       (= message "Abruptly closed by peer")
       (instance? javax.net.ssl.SSLHandshakeException
                  (:cognitect.http-client/throwable resp))))

(defn default-retriable?
  [response]
  (or (cognitect.aws.retry/default-retriable? response)
      (aws-ssl-ca-error? response)))

Then we use our version of default-retriable? when creating clients and use it instead of this library's default-retriable? when we want to override it for specific operation. Hopefully this helps out anyone else that runs into the problem.

If I can find the time I will try to dive into the issue more and see if I can reliably reproduce it, but that has proven difficult so far.

bowbahdoe commented 2 years ago

@dchelimsky we are also running into this error when publishing to eventbridge