Support Apache Aurora - Githubissues

SEJeff commented 9 years ago

Aurora leverages zookeeper in a format called "server sets" to publish data for a service discovery system. It would be great if ochopod could read and use this data.

opaugam commented 9 years ago

Hey Jeff,

I agree. I have to confess I haven't yet used Aurora and will need to first get familiar with its API. Just wondering .. if I asked you to tell me how Aurora differs (or interesects) with Marathon, what would you say ?

SEJeff commented 9 years ago

Aurora is a superset of marathon designed to be bulletproof for operations and easy to use for developers. It is a big harder to setup than Marathon, but I find the command line tools for day to day use and the admin client to be far superior to the marathon cli gem or the curl based interface.

Some of the little stuff, like the thermos executor, which actually will fork a process that does healthchecks inside the container instead of (like marathon) doing it from the scheduler directly. Instead of a json document (like marathon), they have a simple python DSL for describing jobs. Several of the engineers from google working on Borg ended up working at Twitter and built Aurora ontop of mesos in a similar vein to Borg. They've battle hardened Aurora to scale to 10s of thousands of nodes. Just because mesos can scale that big, does not mean that all frameworks can. I noticed after too many healthchecks, the marathon master scheduler got a bit slower. This doesn't happen with Aurora, by design.

Aurora is also one of the (if not the only) only generic mesos schedulers to support pre-emption. If you have a requirements that 10 instances of a production job run at all times, but the cluster is oversubscribed due to someone launching 100 dev instances of some applications, Aurora will "pre-empt" them by killing the dev tasks, to ensure the production tasks always get the resources they need. One of the big wins for me is that Aurora supports rolling restarts in a graceful manner, something that marathon still doesn't. Yesterday, I did a test scaling an app up from 2 instances to 15 instances, and back down to 2 instances in a loop for about 1/2 of the day. Using aurproxy, once I got the timeouts correct, there wasn't a single http error. As far as all clients were concerned, the app was 100% available even while instances were constantly being killed. I put a bit over 1 millon requests through that setup for my test.

For firms that care about security, Aurora supports kerberos out of the box, and the security is pluggable via Apache Shiro. Marathon on the other hand, only supports a static set of http basic auth credentials over ssl. This is simply not production-worthy at a large firm like mine where multi-tenancy must be thought of from the beginning.

Lastly, Marathon is controlled and developed by a single company really, Mesosphere. While they are a fantastic company and there is nothing wrong with that in and of itsself, I've found some serious issues with getting responses from their IRC channel #marathon on freenode, or occasional emails to their users list go unanswered. In contrast, the Aurora IRC channel #aurora is always full of people to help. In fact, #aurora is more often more helpful with managing / deploying mesos at scale than #mesos, as many of the large engineering organizations using Mesos are using Aurora.

In short, both are great. Marathon optimizes to go from installation to hello world running as quickly as possible. It has a nice ui, and great documentation, and is light on features. Aurora is a bit heavier weight, harder to install, but has clearly been battle tested in production at a large firm for years. Aurora is optimized for making infrastructure that never goes down.

opaugam commented 9 years ago

Excellent answer, thanks ! You got me interested with that pre-emption capability. How does the scheduler fares compared to - for instance - the Diego scheduler from Cloud Foundry ?

From: Jeff Schroeder notifications@github.com Sent: Thursday, July 9, 2015 11:00 AM To: autodesk-cloud/ochopod Cc: Olivier Paugam Subject: Re: [ochopod] Support Apache Aurora (#19)

Aurora is a superset of marathon designed to be bulletproof for operations and easy to use for developers. It is a big harder to setup than Marathon, but I find the command line tools for day to day use and the admin client to be far superior to the marathon cli gem or the curl based interface.

Some of the little stuff, like the thermos executor, which actually will fork a process that does healthchecks inside the container instead of (like marathon) doing it from the scheduler directly. Instead of a json document (like marathon), they have a simple python DSL for describing jobs. Several of the engineers from google working on Borg ended up working at Twitter and built Aurora ontop of mesos in a similar vein to Borg. They've battle hardened Aurora to scale to 10s of thousands of nodes. Just because mesos can scale that big, does not mean that all frameworks can. I noticed after too many healthchecks, the marathon master scheduler got a bit slower. This doesn't happen with Aurora, by design.

Aurora is also one of the (if not the only) only generic mesos schedulers to support pre-emption. If you have a requirements that 10 instances of a production job run at all times, but the cluster is oversubscribed due to someone launching 100 dev instances of some applications, Aurora will "pre-empt" them by killing the dev tasks, to ensure the production tasks always get the resources they need. One of the big wins for me is that Aurora supports rolling restarts in a graceful manner, something that ]marathon still doesn't](mesosphere/marathon#712 (comment)https://github.com/mesosphere/marathon/issues/712#issuecomment-119807562). For firms that care about security, Aurora supports kerberos out of the box, and the security is pluggable via Apache Shiro. Marathon on the other hand, only supports a static set of http basic auth credentials over ssl. This is simply not production-worthy at a large firm like mine.

Lastly, Marathon is controlled and developed by a single company really, Mesosphere. While they are a fantastic company and there is nothing wrong with that in and of itsself, I've found some serious issues with getting responses from their IRC channel #marathon on freenode, or occasional emails to their users list go unanswered. In contrast, the Aurora IRC channel #aurora is always full of people to help. In fact, #aurora is more often more helpful with managing / deploying mesos at scale than #mesos, as many of the large engineering organizations using Mesos are using Aurora.

In short, both are great. Marathon optimizes to go from installation to hello world running as quickly as possible. It has a nice ui, and great documentation, and is light on features. Aurora is a bit heavier weight, harder to install, but has clearly been battle tested in production at a large firm for years. Aurora is optimized for making infrastructure that never goes down.

Reply to this email directly or view it on GitHubhttps://github.com/autodesk-cloud/ochopod/issues/19#issuecomment-120017037.

SEJeff commented 9 years ago

@opaugam glad I was able to explain it clearly to you. I've not used CloudFoundry, but do know that the first commit for Aurora was in April 2010, and the first commit for Diego was in Jun 2014, so it is more mature. However, from reading an interview with one of the Diego developers, it almost looks more like a competitor to Mesos, but without the ability to have the concept of frameworks.

autodesk-cloud / ochopod

Support Apache Aurora #19