cognitect-labs / aws-api

AWS, data driven
Apache License 2.0
724 stars 100 forks source link

Creds fail to load (from env) when Java11 snap-start enabled #225

Closed stevebuik closed 1 year ago

stevebuik commented 1 year ago

Thank you for your interest in helping to improve Cognitect's aws-api!

Dependencies

Be sure to list the precise libs and versions you are using ("the latest" might change by the time we're looking at your issue).

e.g.

{:deps {com.cognitect.aws/api                      {:mvn/version "0.8.630"}
           com.cognitect.aws/endpoints                {:mvn/version "1.1.12.358"}
           com.cognitect.aws/lambda                   {:mvn/version "825.2.1263.0"}
           com.cognitect.aws/apigatewaymanagementapi  {:mvn/version "821.2.1107.0"}
.. others elided
}}

Description with failing test case

(defn send-message!
  [{:keys [domain-name stage message connection-id]}]
  (println (sort (keys (System/getenv))))
  (let [client (-> {:api                  :apigatewaymanagementapi
                    :endpoint-override    {:hostname domain-name
                                           :path     (format "/%s/@connections/" stage)}
                    :credentials-provider (reify CredentialsProvider
                                            (fetch [_]
                                              {:aws/access-key-id     (u/getenv "AWS_ACCESS_KEY_ID")
                                               :aws/secret-access-key (u/getenv "AWS_SECRET_ACCESS_KEY")
                                               :aws/session-token     (u/getenv "AWS_CONTAINER_AUTHORIZATION_TOKEN")}))}
                   (aws/client))]
    ;(aws/validate-requests client)
    (->> {:op      :PostToConnection
          :request {:ConnectionId connection-id
                    :Data         (.getBytes message)}}
         (aws/invoke client))))

Stack traces

{:message "The security token included in the request is invalid.", :cognitect.anomalies/category :cognitect.anomalies/forbidden}

stevebuik commented 1 year ago

works fine when snapstart is disabled (albeit with slow cold start) and AWS_SESSION_TOKEN used instead

stevebuik commented 1 year ago

further tests using container creds led to this variation of the fn

(defn send-message!
  [{:keys [domain-name stage message connection-id]}]

  (println
    "creds response"
    (u/getenv ec2/container-credentials-full-uri-env-var)
    (pr-str
      (a/<!!
        (http/submit
          (cognitect.http-client/create {})
          (#'mu/request-map (URI. (u/getenv ec2/container-credentials-full-uri-env-var)))))))

  (let [client (-> {:api                  :apigatewaymanagementapi
                    :endpoint-override    {:hostname domain-name
                                           :path     (format "/%s/@connections/" stage)}
                    :credentials-provider (credentials/container-credentials-provider (shared/http-client))}
                   (aws/client))]
    ;(aws/validate-requests client)
    (->> {:op      :PostToConnection
          :request {:ConnectionId connection-id
                    :Data         (.getBytes message)}}
         (aws/invoke client))))

which generates this log response...

creds response http://127.0.0.1:9001/2021-04-23/credentials {:status 404, :headers {"x-content-type-options" "nosniff", "content-length" "46", "date" "Thu, 15 Dec 2022 03:51:06 GMT", "content-type" "text/plain; charset=UTF-8"}, :body #object[java.nio.HeapByteBuffer 0x3d526ad9 "java.nio.HeapByteBuffer[pos=0 lim=46 cap=46]"]}

which I presume means that the ec2 credentials request process does not work in the lambda snapstart environment

stevebuik commented 1 year ago

this works!!

(defn snapstart-credentials-provider
  "return a creds provider that uses the 2 Snapstart environment variables to get current keys/tokens for API requests"
  [http-client]
  (reify CredentialsProvider
    (fetch [_]
      (let [request (-> (u/getenv mu/container-credentials-full-uri-env-var)
                        (URI.)
                        (#'mu/request-map)                  ; TODO has https bug but doesn't affect this provider which uses http/80
                        (assoc-in [:headers "Authorization"] (u/getenv "AWS_CONTAINER_AUTHORIZATION_TOKEN")))
            response (->> request
                          (http/submit http-client)
                          (a/<!!))]
        (when (= 200 (:status response))
          (let [creds (some-> (:body response) (u/bbuf->str) (json/read-str :key-fn keyword))]
            (credentials/valid-credentials
              {:aws/access-key-id     (:AccessKeyId creds)
               :aws/secret-access-key (:SecretAccessKey creds)
               :aws/session-token     (:Token creds)
               ::credentials/ttl      (credentials/calculate-ttl creds)}
              "snapstart container")))))))

(defn send-message!
  [{:keys [domain-name stage message connection-id]}]
  (let [client (-> {:api                  :apigatewaymanagementapi
                    :endpoint-override    {:hostname domain-name
                                           :path     (format "/%s/@connections/" stage)}
                    :credentials-provider (snapstart-credentials-provider (http/create {}))}
                   (aws/client))]
    ;(aws/validate-requests client)
    (->> {:op      :PostToConnection
          :request {:ConnectionId connection-id
                    :Data         (.getBytes message)}}
         (aws/invoke client))))
stevebuik commented 1 year ago

with that impl, the cold start first request is a bit slow... REPORT RequestId: 53dd2192-1c24-456d-b226-f3e675810ced Duration: 7296.40 ms Billed Duration: 7458 ms Memory Size: 1024 MB Max Memory Used: 174 MB Restore Duration: 372.46 ms

but subsequent requests are as quick as you would expect... REPORT RequestId: b6cabdbe-d556-4b89-9313-6ebfc609814c Duration: 187.31 ms Billed Duration: 188 ms Memory Size: 1024 MB Max Memory Used: 175 MB

not sure why the slow first request but it does work. I'll move forward with this until an official release (that might improve cold start perf as well?)

stevebuik commented 1 year ago

ok, that slow first request has nothing to do with creds.

it's due to the :PostToConnection call which takes 5-7 secs on first call and 200ms in subsequent calls. I'd guess this is some kind of keep-alive in the client (can we control this?) or aws network init delay (which we can't control)

Either way, this looks like a good creds solution. just needs the official port of this impl. let me know if you need any other info

dchelimsky commented 1 year ago

Thanks for digging this up. I think we can fix this in the ec2-metadata-utils (add the header there if the env var is present). That's what the java sdk v2 is doing. I'm surprised this hasn't come up until now!

dchelimsky commented 1 year ago

I pushed a speculative fix to the dc/container-credentials-auth-header branch. It might be a few days before I'll have time to test this properly, so please feel free to build and test it locally.

stevebuik commented 1 year ago

Thanks. I'll try it today. I have a clarification question. With my solution, I had trouble making it work with (shared/http-client) so that's why I used (http/create {})

there are 2 http client protocols available in this project: cognitect.http-client/IClient (submittable) cognitect.aws.http/HttpClient (stoppable)

The container creds fn needs the submittable type. When it's part of the chain it appears to be passed the stoppable. Is this a bug or am I thinking about this wrong?

I ask because I've experienced the client memory leak in Ions in the past so I want to make sure I use this correctly (even though Lambdas are less prone to leaks i.e. cattle not pets)

dchelimsky commented 1 year ago

cognitect.http-client/IClient is part of a dependency, not part of aws-api.

cognitect.aws.http/HttpClient has -submit and -stop, so it's both submittable and stoppable. Structurally, the cognitect http-client is always wrapped in a reification of cognitect.aws.http/HttpClient, created by calling cognitect.aws.http.cognitect/create.

So either gives you instances of the same thing, though (shared/http-client) will always return the same object, so it's possible that that's the source of any trouble you saw since the aws-client's http-client is waiting for the credential provider's http-client to return before it can complete its work. That shouldn't be an issue, but bugs happen, so let me know if you still see the same problem.

stevebuik commented 1 year ago

thanks for clarifying. I just tested your fix. don't fully understand the error but using

:credentials-provider (ec2/container-credentials (shared/http-client))

produces...

:message "No implementation of method: :fetch of protocol: #'cognitect.aws.credentials/CredentialsProvider found for class: clojure.lang.PersistentArrayMap"

when I switch back to my custom provider using the same commit, it works fine.

Since my feedback loop via CI is slow, I'm gonna pause testing your code because I need to focus on my deliverable. So I'll wait till you have time to setup a test harness - that'll be more effiecient for both of us.

Happy to provide any other info in the meantime or when you are setting up.

No hurry as I have a working solution till then.

dchelimsky commented 1 year ago

Actually, with my fix you shouldn't need to provide a provider at all. If you have time, give that a shot. If not, it can wait.

stevebuik commented 1 year ago

That was going to be my last test :) I just ran it and it works! I'll stick with your commit now and will report any other issues that arise.

stevebuik commented 1 year ago

this is now working for multiple lambdas and multiple AWS apis. I would suggest this can be closed but....

I did uncover another bug that is related to this. It's that the TTL on the container creds doesn't appear to be refreshing after a snapstart resume. I'll attach my ADR doc to this comment. It contains details on the problem and my (verified) workaround.

Since I have a workaround, it's not urgent, but I figure you want to know if a TTL isn't working properly i.e. I suspect this is broken outside of the snapstart use-case as well. adr-0002-cold-starts.md

dchelimsky commented 1 year ago

Fix released in aws-api-0.8.635

dchelimsky commented 1 year ago

Fix released in aws-api-0.8.641