CovertLab / sisyphus

Eternally execute tasks
0 stars 0 forks source link

Using gce metadata to configure rabbit queue #21

Closed prismofeverything closed 4 years ago

prismofeverything commented 4 years ago

This PR enables Sisyphus to look to its own GCE instance metadata to configure which rabbit exchange, queue and routing key to use for consuming task messages.

Currently this works in theory, but I am not getting results back from log/gce-metadata even though those metadata fields are set. Further investigation is necessary, so this PR is not yet ready to merge.

One thing to point out is that the values for rabbit configuration are able to come from four different places, so I had to make a decision about the priority order for setting these values. Currently:

  1. Command line options (won't be used on servers, but useful for testing)
  2. GCE metadata fields (currently this is the request that is not working)
  3. The config file at resources/config/sisyphus.clj
  4. The default values defined in the sisyphus.rabbit/default-config ns.
1fish2 commented 4 years ago

(My previous msg got stuck in the outbox while at the gym.)

The http request for metadata is a feature of gce VMs. Anyway gce metadata is set when launching a gce vm so their isn't any when running off gce. I like your debug config for that. It can try to get metadata and fallback to the config file.

Jerry

On Thu, Nov 21, 2019, 12:36 PM Ryan Spangler notifications@github.com wrote:

@prismofeverything commented on this pull request.

In src/sisyphus/core.clj https://github.com/CovertLab/sisyphus/pull/21#discussion_r349305458:

@@ -134,10 +135,20 @@ (update state :kafka merge consumer)))

(defn start!

  • "Start the system by making all the required connections and returning the state map."
  • "Start the system by making all the required connections and returning the state map.
  • The priority of values for rabbit config are
    1. command line options
    1. gce metadata fields
    1. values from file resources/config/sisyphus.clj
    1. defaults from ns rabbit/default-config"

and then sisyphus/rabbit code derives the 3 rabbit queuing parameters from that workflow name?

Ah I like this, this is the way to go. Yeah let's just supply one parameter and sisyphus can work out the details.

Is gce-metadata not getting results due to not running on a GCE VM? I tested it on a GCE worker node.

Ah yeah, maybe doesn't work from a remote machine? I am leaning to doing this with the compute service from the previous PR instead of an HTTP call where you have to worry about passing in credentials etc if you are not in the network. I'll look into it.

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/CovertLab/sisyphus/pull/21?email_source=notifications&email_token=AAH6VMFXX635OD2WGQFEKVLQU3WN5A5CNFSM4JP3PYJ2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCMSK2OY#discussion_r349305458, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH6VMBYERWM4FYFR53QJCTQU3WN5ANCNFSM4JP3PYJQ .

prismofeverything commented 4 years ago

Okay @1fish2, latest update has:

1fish2 commented 4 years ago

Most excellent!

I agree about the remaining cases that catch any Exception since it logs the exception and punts on that chunk of work.

One detail: In log/gce-metadata, http/get throws UnknownHostException if it's not running on GCE and thus can't access "http://metadata.google.internal", but it throws a different exception if the requested metadata field like "attributes/workflow" wasn't set. Missing attributes will happen when we change the software on one server at a time.

I can't tell if "slingshot" rethrows the exception or if it's the fallback catch clause, but this will catch the exception:

(try (clj-http/get "http://metadata.google.internal/computeMetadata/v1/instance/xxx"
        {:headers {:metadata-flavor "Google"}})
 (catch clojure.lang.ExceptionInfo e ...))

Since clojure.lang.ExceptionInfo seems to be a general purpose exception like Exception, the code should distinguish the expected case from unexpected cases. (.getMessage e) returns "clj-http: status 404" and (.data e) returns

{:cached nil, :request-time 8, :repeatable? false,
 :protocol-version {:name "HTTP", :major 1, :minor 1}, :streaming? true,
 :http-client #object[org.apache.http.impl.client.InternalHttpClient 0x4cc49356 "org.apache.http.impl.client.InternalHttpClient@4cc49356"],
 :chunked? false, :type :clj-http.client/unexceptional-status, :reason-phrase "Not Found",
 :headers {"Metadata-Flavor" "Google", "Date" "Thu, 05 Dec 2019 00:48:08 GMT", "Content-Type" "text/html; charset=UTF-8", "Server" "Metadata Server for VM", "Connection" "Close", "Content-Length" "1592", "X-XSS-Protection" "0", "X-Frame-Options" "SAMEORIGIN"},
 :orig-content-encoding nil, :status 404, :length 1592,
 :body "<!DOCTYPE html>\n<html lang=en>\n  <meta charset=utf-8>\n  <meta name=viewport content=\"initial-scale=1, minimum-scale=1, width=device-width\">\n  <title>Error 404 (Not Found)!!1</title>\n  <style>\n    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}\n  </style>\n  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>\n  <p><b>404.</b> <ins>That’s an error.</ins>\n  <p>The requested URL <code>/computeMetadata/v1/instance/xxx</code> was not found on this server.  <ins>That’s all we know.</ins>\n",
 :trace-redirects []}
1fish2 commented 4 years ago

That experimental code snippet followed (require '[clj-http.client :as clj-http]) since gaia.core requires aleph.http :as http.

Is it OK in this case to just handle clojure.lang.ExceptionInfo as an unset metadata field? It's not obvious to me what expanse of code raises ExceptionInfo.

Hmm, ex-data?

prismofeverything commented 4 years ago

Hey @1fish2, I'm not sure about all this. I thought ExceptionInfo was a class that describes whatever exception got thrown, not an exception in its own right. It is a way to wrap the exception and associate a map of data along with it. The real exception should be in that data structure, though you are right this may be the only way to catch it. More investigation is required.

prismofeverything commented 4 years ago

Okay @1fish2, looking into this further it turns out clj-http is throwing exceptions on status failures by default, and has an option to turn this off (:throw-exceptions false, see https://github.com/CovertLab/sisyphus/pull/21/commits/0d3ddbc42c85ca14130563f960b29f4f613e8250)

So we can handle that case directly and not have to do it through exception handling. I set up gce-metadata to always return either the requested field or nil, so callers can find the value another way. Let me know what you think.

1fish2 commented 4 years ago

Great! I'd probably explain in the comment that the expected failure modes are UnknownHostException when not running on GCE and not-OK when the requested field wasn't set. If we ever need to treat those two differently, we won't have to reconstruct the backstory.

clj-http says "batteries included" but I don't see where it defines symbolic status codes. I'm fine with 200 here but I'm going one step further in a Gaia fix to the error handling. I'll just return status 400 if there was any handler exception rather than distinguishing client errors from server errors, but it's getting closer to wanting symbolic codes.

prismofeverything commented 4 years ago

It turns out we get a 404 if the metadata field does not exist, which translates to "not found". I added a comment in there. I figure if people want to know if they are on GCE or not they can use a method for that specifically, this function doesn't need to act as dual GCE detector/metadata fetcher.

clj-http says "batteries included" but I don't see where it defines symbolic status codes.

What do you mean exactly by "symbolic status codes"? Currently it is using standard HTTP response codes, and I actually appreciate the integers for this purpose as they map exactly on to the HTTP standard.

1fish2 commented 4 years ago

Yesterday I was thinking that we might later want different behavior for debugging off of GCE vs. the ordinary case of reading an attribute that wasn't set. But we can use a different config file for debugging.

A few HTTP status codes like 404 are as well known as their names, but for most of them I'd recognize the names but not the numbers.