camsas / firmament

The Firmament cluster scheduling platform
Apache License 2.0
415 stars 79 forks source link

Best Way to schedule multiple workers? #53

Closed Mythra closed 7 years ago

Mythra commented 7 years ago

I notice in the documentation you've written:

To use Firmament across multiple machines, you need to run a coordinator instance on each machine. 
These coordinators can then be arranged in a tree hierarchy, in which each coordinator can schedule 
tasks locally and on its subordinate childrens' resources.

Yet you've also written:

The parent coordinator must already be running. Once both coordinators are up, you will be able to 
see the child resources on the parent coordinator's web UI.

It seems to imply here that we can create multiple coordinators (such as 3), yet also seems to imply we can only have two coordinators running.

Starting three schedulers in the pattern:

tcp:10.36.75.73:8000 (no parent uri) tcp:10.36.65.78:8000 (--parent-uri tcp:10.36.75.73:8000 ) tcp:10.36.71.204:8000 (--parent-uri tcp:10.36.75.73:8000 )

Yet the topology map shows only two hosts. I'm assuming this is because this is due to the project being "alpha stage" as mentioned, and totally understandable. Just want to make sure I'm not going crazy.

ms705 commented 7 years ago

Hi @SecurityInsanity,

You can definitely have any number of coordinators, not just two (we've routinely run Firmament with up to 40).

The pattern in which you start them up is also correct (modulo the command line parameter being called --parent_uri, IIRC). Do your machines have different hostnames? Some places in the system use hostnames, so if they have the same default hostname, incorrect information might be displayed. We try to use UUID instead of hostnames, but it's possible that something has slipped through the net.

To debug this, it would be helpful to have the log output from each coordinator, and in particular the output from the one on 10.36.75.73 (which should see the child coordinators register in its output). You can turn on more verbose output by passing --v=1 to the coordinator binaries.

Mythra commented 7 years ago

Hey @ms705 ,

Thanks for following up so quickly!

Hmmm I do believe:

tcp:10.36.65.78:8000 (--parent-uri tcp:10.36.75.73:8000 )
tcp:10.36.71.204:8000 (--parent-uri tcp:10.36.75.73:8000 )

Both have the exact hostname, I'll go change those up, and if it still seems fishy I'll go ahead, and send you some verbose logging.

Mythra commented 7 years ago

Yep, setting unique hostnames seemed to work @ms705 . Thanks much, and great work :wave: