Closed npiasco closed 4 years ago
I read through BistroWorkerHandler.{cpp,h}
with a bit of attention, and I think what's going on is this. Resource reporting does not yet start when the worker is constructed -- so your log with an empty struct from line 100 is not diagnostic.
For the scheduler to learn the worker's resources, two things must happen:
So looking at your description, I don't see anything that shouldn't work. However, it might take healthcheck_interval + 2 * heartbeat_interval
seconds for the scheduler to be able to run anything on the worker.
I might be missing something here.
It would be easier to be sure if you could share the full scheduler & worker logs for the failure scenario.
Thank you for your quick answer.
Here the log of the worker launched with cmd ./cmake/Debug/worker/bistro_worker --scheduler_host=:: --scheduler_port=6789 --worker_command="$HOME/demo_bistro_task.sh" --data_dir=/tmp/bistro_worker --nvidia_smi=/usr/bin/nvidia-smi
:
W0518 18:05:15.937683 9049 server_socket.cpp:88] Found no 10 interfaces that are not link-local or loopback
I0518 18:05:15.939427 9049 LogWriter.cpp:77] Created table stderr
I0518 18:05:15.939468 9049 LogWriter.cpp:77] Created table stdout
I0518 18:05:15.939507 9049 LogWriter.cpp:77] Created table statuses
I0518 18:05:15.939855 9050 AutoTimer.h:143] Pruned logs with cutoff 1587225915 in 67.98 us
I0518 18:05:17.254093 9049 BistroWorkerHandler.cpp:100] Worker is ready: BistroWorker {
1: shard (string) = "alphacar-trx40-ip39",
2: machineLock (struct) = MachinePortLock {
1: hostname (string) = "alphacar-trx40-ip39",
2: port (i32) = 33237,
},
3: addr (struct) = ServiceAddress {
1: ip_or_host (string) = "10.203.78.195",
2: port (i32) = 33237,
},
4: id (struct) = BistroInstanceID {
1: startTime (i64) = 1589817917,
2: rand (i64) = 3980752265297529627,
},
5: heartbeatPeriodSec (i32) = 15,
6: protocolVersion (i16) = 2,
7: usableResources (struct) = UsablePhysicalResources {
1: msSinceEpoch (i64) = 0,
2: cpuCores (double) = 0.0,
3: memoryMB (double) = 0.0,
4: gpus (list) = list<struct>[0] {
},
},
}
W0518 18:05:17.258365 9055 BistroWorkerHandler.cpp:665] Waiting for this worker to start listening on ServiceAddress {
1: ip_or_host (string) = "10.203.78.195",
2: port (i32) = 33237,
}: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0518 18:05:18.263059 9191 AutoTimer.h:143] Query: 'SELECT job_id, node_id, time_and_count, line FROM statuses WHERE (time_and_count <= 0) ORDER BY time_and_count DESC LIMIT 2'; args: ' in 300 ns
I0518 18:05:18.263224 9191 LogWriter.cpp:218] Got 0 statuses lines
I0518 18:05:18.270023 9055 BistroWorkerHandler.cpp:698] Connected to new scheduler SchedulerHeartbeatResponse {
1: id (struct) = BistroInstanceID {
1: startTime (i64) = 1589817900,
2: rand (i64) = -5012333484077636209,
},
2: maxHealthcheckGap (i32) = 125,
3: heartbeatGracePeriod (i32) = 60,
4: loseUnhealthyWorkerAfter (i32) = 500,
5: workerCheckInterval (i32) = 5,
6: workerState (i32) = 0,
7: protocolVersion (i16) = 2,
8: workerSetID (struct) = WorkerSetID {
1: schedulerID (struct) = BistroInstanceID {
1: startTime (i64) = 1589817900,
2: rand (i64) = -5012333484077636209,
},
2: version (i64) = 1,
3: hash (struct) = WorkerSetHash {
1: startTime (struct) = SetHash {
1: addAll (i64) = 1589817917,
2: xorAll (i64) = 1589817917,
},
2: rand (struct) = SetHash {
1: addAll (i64) = 3980752265297529627,
2: xorAll (i64) = 3980752265297529627,
},
3: numWorkers (i32) = 1,
},
},
}
W0518 18:05:18.272003 9190 BistroWorkerHandler.cpp:279] CGroups set: CGroupOptions {
1: root (string) = "",
2: slice (string) = "",
3: subsystems (list) = list<string>[0] {
},
4: unitTestCreateFiles (bool) = false,
5: cpuShares (i16) = 0,
6: memoryLimitInBytes (i64) = 0,
7: killWithoutFreezer (bool) = false,
}
I0518 18:05:18.295586 9190 BistroWorkerHandler.cpp:295] Queueing healthcheck started at 1589817918
I0518 18:05:18.309382 9325 TaskSubprocessQueue.cpp:306] Healthcheck started at 1589817918 quit with status '{"result_bits":4,"time":1589817918}'
I0518 18:05:34.259470 9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:35.259667 9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:36.259869 9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:37.260035 9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:38.260329 9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:39.260982 9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:40.261147 9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:41.261797 9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:42.262528 9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:43.262707 9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:44.262872 9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:45.263036 9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:46.263200 9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:47.263365 9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:48.263525 9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:48.275070 9055 BistroWorkerHandler.cpp:635] Became healthy
I0518 18:06:18.281457 9190 BistroWorkerHandler.cpp:295] Queueing healthcheck started at 1589817978
I0518 18:06:18.292491 9331 TaskSubprocessQueue.cpp:306] Healthcheck started at 1589817978 quit with status '{"result_bits":4,"time":1589817978}'
I0518 18:07:18.291656 9190 BistroWorkerHandler.cpp:295] Queueing healthcheck started at 1589818038
I0518 18:07:18.304718 9336 TaskSubprocessQueue.cpp:306] Healthcheck started at 1589818038 quit with status '{"result_bits":4,"time":1589818038}'
I0518 18:08:18.301235 9190 BistroWorkerHandler.cpp:295] Queueing healthcheck started at 1589818098
I0518 18:08:18.312572 9341 TaskSubprocessQueue.cpp:306] Healthcheck started at 1589818098 quit with status '{"result_bits":4,"time":1589818098}'
Log from the sheduler with command ./cmake/Debug/server/bistro_scheduler --server_port=6789 --http_server_port=6855 --config_file=scripts/test_configs/pys_res --CAUTION_startup_wait_f or_workers=1 --instance_node_name=scheduler --clean_statuses
:
I0518 18:05:00.648990 8755 AutoTimer.h:143] Read config from /media/alphacar/XFS/workspace/lib/bistro/bistro/scripts/test_configs/pys_res in 33.18 us
I0518 18:05:00.649794 8755 AutoTimer.h:143] Parsed config with 1 jobs in 736.1 us
I0518 18:05:00.649979 8755 AutoTimer.h:143] Have 7 nodes after manual in 40.5 us
I0518 18:05:00.650238 8758 Monitor.cpp:77] Updating monitor histogram (/media/alphacar/XFS/workspace/lib/bistro/bistro/monitor/Monitor.cpp:63): Monitor transiently not making a histogram for simple_job since it is not loaded
E0518 18:05:00.650840 8755 HTTPMonitorServer.cpp:110] Not enabling HTTPS support because one of the environment variables THRIFT_TLS_CL_CERT_PATH, THRIFT_TLS_CL_KEY_PATH, or THRIFT_TLS_SRV_CA_PATH was not set
W0518 18:05:00.650913 8760 RemoteWorkerRunner.cpp:93] RemoteWorkerRunner initial wait (/media/alphacar/XFS/workspace/lib/bistro/bistro/runners/RemoteWorkerRunner.cpp:79): DANGER! DANGER! Your --CAUTION_startup_wait_for_workers of 1 is l
ower than the max healthcheck gap of 125, which makes it very likely that you will start second copies of tasks that are already running (unless your heartbeat interval is much smaller). No initial worker set ID consensus. Waiting for al
l workers to connect before running tasks.
I0518 18:05:00.651414 8761 Bistro.cpp:182] Idle wait...
I0518 18:05:00.655133 8755 HTTPMonitorServer.cpp:171] Launched HTTP Monitor Server on port 6855, result 0
I0518 18:05:05.651787 8761 Bistro.cpp:182] Idle wait...
I0518 18:05:10.650570 8756 AutoTimer.h:143] Config was already fresh in 21.39 us
I0518 18:05:10.652154 8761 Bistro.cpp:182] Idle wait...
I0518 18:05:15.652770 8761 Bistro.cpp:182] Idle wait...
W0518 18:05:18.268651 8898 RemoteWorkers.cpp:71] Worker alphacar-trx40-ip39 is not yet able to achieve consensus: It has no WorkerSetID
I0518 18:05:18.268687 8898 RemoteWorker.cpp:182] Initializing new worker: BistroWorker {
1: shard (string) = "alphacar-trx40-ip39",
2: machineLock (struct) = MachinePortLock {
1: hostname (string) = "alphacar-trx40-ip39",
2: port (i32) = 33237,
},
3: addr (struct) = ServiceAddress {
1: ip_or_host (string) = "10.203.78.195",
2: port (i32) = 33237,
},
4: id (struct) = BistroInstanceID {
1: startTime (i64) = 1589817917,
2: rand (i64) = 3980752265297529627,
},
5: heartbeatPeriodSec (i32) = 15,
6: protocolVersion (i16) = 2,
7: usableResources (struct) = UsablePhysicalResources {
1: msSinceEpoch (i64) = 0,
2: cpuCores (double) = 0.0,
3: memoryMB (double) = 0.0,
4: gpus (list) = list<struct>[0] {
},
},
}
I0518 18:05:18.269150 8898 AutoTimer.h:143] Sent healthchecks to 1 workers in 28.63 us
I0518 18:05:18.272475 8759 AutoTimer.h:143] Got 0 running tasks from worker BistroWorker {
1: shard (string) = "alphacar-trx40-ip39",
2: machineLock (struct) = MachinePortLock {
1: hostname (string) = "alphacar-trx40-ip39",
2: port (i32) = 33237,
},
3: addr (struct) = ServiceAddress {
1: ip_or_host (string) = "10.203.78.195",
2: port (i32) = 33237,
},
4: id (struct) = BistroInstanceID {
1: startTime (i64) = 1589817917,
2: rand (i64) = 3980752265297529627,
},
5: heartbeatPeriodSec (i32) = 15,
6: protocolVersion (i16) = 2,
7: usableResources (struct) = UsablePhysicalResources {
1: msSinceEpoch (i64) = 0,
2: cpuCores (double) = 0.0,
3: memoryMB (double) = 0.0,
4: gpus (list) = list<struct>[0] {
},
},
} in 1.789 ms
I0518 18:05:18.272531 8759 AutoTimer.h:143] Recorded 0 new running tasks for alphacar-trx40-ip39 in 10.25 us
I0518 18:05:19.260581 8898 RemoteWorker.cpp:279] Good healthcheck from alphacar-trx40-ip39 sent 1589817918
I0518 18:05:20.650768 8756 AutoTimer.h:143] Config was already fresh in 20.97 us
W0518 18:05:20.652724 8760 RemoteWorkers.cpp:366] Not pruning history until 1 workers echo their first WorkerSetID from this scheduler
W0518 18:05:20.652770 8760 RemoteWorkers.cpp:71] Worker alphacar-trx40-ip39 is not yet able to achieve consensus: It has no WorkerSetID
I0518 18:05:20.652782 8760 RemoteWorker.cpp:171] Worker alphacar-trx40-ip39 can be healthy but lacks WorkerSetID consensus
I0518 18:05:20.653138 8761 Bistro.cpp:182] Idle wait...
W0518 18:05:25.652928 8760 RemoteWorkers.cpp:366] Not pruning history until 1 workers echo their first WorkerSetID from this scheduler
W0518 18:05:25.652969 8760 RemoteWorkers.cpp:71] Worker alphacar-trx40-ip39 is not yet able to achieve consensus: It has no WorkerSetID
I0518 18:05:25.652981 8760 RemoteWorker.cpp:171] Worker alphacar-trx40-ip39 can be healthy but lacks WorkerSetID consensus
I0518 18:05:25.653486 8761 Bistro.cpp:182] Idle wait...
I0518 18:05:30.651010 8756 AutoTimer.h:143] Config was already fresh in 22.61 us
W0518 18:05:30.653170 8760 RemoteWorkers.cpp:366] Not pruning history until 1 workers echo their first WorkerSetID from this scheduler
W0518 18:05:30.653215 8760 RemoteWorkers.cpp:71] Worker alphacar-trx40-ip39 is not yet able to achieve consensus: It has no WorkerSetID
I0518 18:05:30.653228 8760 RemoteWorker.cpp:171] Worker alphacar-trx40-ip39 can be healthy but lacks WorkerSetID consensus
I0518 18:05:30.653877 8761 Bistro.cpp:182] Idle wait...
W0518 18:05:33.272017 8898 RemoteWorkers.cpp:71] Worker alphacar-trx40-ip39 is not yet able to achieve consensus: It has no WorkerSetID
I0518 18:05:33.272078 8898 RemoteWorker.cpp:171] Worker alphacar-trx40-ip39 can be healthy but lacks WorkerSetID consensus
I0518 18:05:35.653426 8760 RemoteWorkers.cpp:135] Worker alphacar-trx40-ip39 has not been healthy, but WorkerSetID consensus allows it.
I0518 18:05:35.653462 8760 RemoteWorker.cpp:201] Worker alphacar-trx40-ip39 became healthy
I0518 18:05:35.654237 8761 Bistro.cpp:182] Idle wait...
I0518 18:05:40.651409 8756 AutoTimer.h:143] Config was already fresh in 155.3 us
I0518 18:05:40.654633 8761 Bistro.cpp:182] Idle wait...
I0518 18:05:45.655014 8761 Bistro.cpp:182] Idle wait...
I0518 18:05:50.652385 8756 AutoTimer.h:143] Config was already fresh in 21.33 us
I0518 18:05:50.655412 8761 Bistro.cpp:182] Idle wait...
I0518 18:05:55.655776 8761 Bistro.cpp:182] Idle wait...
I0518 18:06:00.652582 8756 AutoTimer.h:143] Config was already fresh in 22.6 us
I0518 18:06:00.656126 8761 Bistro.cpp:182] Idle wait...
I0518 18:06:05.656486 8761 Bistro.cpp:182] Idle wait...
I0518 18:06:10.653295 8756 AutoTimer.h:143] Config was already fresh in 22.15 us
I0518 18:06:10.656982 8761 Bistro.cpp:182] Idle wait...
I0518 18:06:15.657380 8761 Bistro.cpp:182] Idle wait...
I0518 18:06:18.279641 8898 AutoTimer.h:143] Sent healthchecks to 1 workers in 49.04 us
I0518 18:06:18.302333 8898 RemoteWorker.cpp:279] Good healthcheck from alphacar-trx40-ip39 sent 1589817978
I0518 18:06:20.653497 8756 AutoTimer.h:143] Config was already fresh in 20.78 us
I0518 18:06:20.657994 8761 Bistro.cpp:182] Idle wait...
I0518 18:06:25.658383 8761 Bistro.cpp:182] Idle wait...
I0518 18:06:30.653698 8756 AutoTimer.h:143] Config was already fresh in 20.96 us
I0518 18:06:30.658742 8761 Bistro.cpp:182] Idle wait...
I0518 18:06:35.659106 8761 Bistro.cpp:182] Idle wait...
I0518 18:06:40.653894 8756 AutoTimer.h:143] Config was already fresh in 21.59 us
I0518 18:06:40.659466 8761 Bistro.cpp:182] Idle wait...
I0518 18:06:45.659816 8761 Bistro.cpp:182] Idle wait...
I0518 18:06:50.654094 8756 AutoTimer.h:143] Config was already fresh in 20.96 us
I0518 18:06:50.660179 8761 Bistro.cpp:182] Idle wait...
I0518 18:06:55.661064 8761 Bistro.cpp:182] Idle wait...
I0518 18:07:00.654352 8756 AutoTimer.h:143] Config was already fresh in 22.81 us
I0518 18:07:00.661697 8761 Bistro.cpp:182] Idle wait...
I0518 18:07:05.662623 8761 Bistro.cpp:182] Idle wait...
I0518 18:07:10.654552 8756 AutoTimer.h:143] Config was already fresh in 20.71 us
I0518 18:07:10.662998 8761 Bistro.cpp:182] Idle wait...
I0518 18:07:15.663394 8761 Bistro.cpp:182] Idle wait...
I0518 18:07:18.289851 8898 AutoTimer.h:143] Sent healthchecks to 1 workers in 47.43 us
I0518 18:07:18.349030 8898 RemoteWorker.cpp:279] Good healthcheck from alphacar-trx40-ip39 sent 1589818038
I0518 18:07:20.654755 8756 AutoTimer.h:143] Config was already fresh in 20.81 us
I0518 18:07:20.663755 8761 Bistro.cpp:182] Idle wait...
I0518 18:07:25.664116 8761 Bistro.cpp:182] Idle wait...
I0518 18:07:30.654960 8756 AutoTimer.h:143] Config was already fresh in 20.91 us
I0518 18:07:30.664467 8761 Bistro.cpp:182] Idle wait...
I0518 18:07:35.664829 8761 Bistro.cpp:182] Idle wait...
I0518 18:07:40.655169 8756 AutoTimer.h:143] Config was already fresh in 20.66 us
I0518 18:07:40.665220 8761 Bistro.cpp:182] Idle wait...
I0518 18:07:45.665578 8761 Bistro.cpp:182] Idle wait...
I0518 18:07:50.655923 8756 AutoTimer.h:143] Config was already fresh in 29.84 us
I0518 18:07:50.666440 8761 Bistro.cpp:182] Idle wait...
I0518 18:07:55.666800 8761 Bistro.cpp:182] Idle wait...
I0518 18:08:00.656145 8756 AutoTimer.h:143] Config was already fresh in 21.33 us
I0518 18:08:00.667163 8761 Bistro.cpp:182] Idle wait...
I0518 18:08:05.667513 8761 Bistro.cpp:182] Idle wait...
I0518 18:08:10.656348 8756 AutoTimer.h:143] Config was already fresh in 22.24 us
I0518 18:08:10.667873 8761 Bistro.cpp:182] Idle wait...
I0518 18:08:15.668234 8761 Bistro.cpp:182] Idle wait...
I0518 18:08:18.299465 8898 AutoTimer.h:143] Sent healthchecks to 1 workers in 47.73 us
I0518 18:08:18.393492 8898 RemoteWorker.cpp:279] Good healthcheck from alphacar-trx40-ip39 sent 1589818098
I0518 18:08:20.656545 8756 AutoTimer.h:143] Config was already fresh in 18.3 us
I0518 18:08:20.668592 8761 Bistro.cpp:182] Idle wait...
I0518 18:08:25.668946 8761 Bistro.cpp:182] Idle wait...
I0518 18:08:30.657305 8756 AutoTimer.h:143] Config was already fresh in 30.63 us
I0518 18:08:30.669996 8761 Bistro.cpp:182] Idle wait...
I0518 18:08:35.670406 8761 Bistro.cpp:182] Idle wait...
I0518 18:08:40.657521 8756 AutoTimer.h:143] Config was already fresh in 21.37 us
I0518 18:08:40.671245 8761 Bistro.cpp:182] Idle wait...
I0518 18:08:45.671584 8761 Bistro.cpp:182] Idle wait...
I0518 18:08:50.657724 8756 AutoTimer.h:143] Config was already fresh in 20.42 us
I0518 18:08:50.671943 8761 Bistro.cpp:182] Idle wait...
I0518 18:08:55.672304 8761 Bistro.cpp:182] Idle wait...
I0518 18:09:00.657964 8756 AutoTimer.h:143] Config was already fresh in 22.51 us
I0518 18:09:00.672636 8761 Bistro.cpp:182] Idle wait...
I0518 18:09:05.672991 8761 Bistro.cpp:182] Idle wait...
Well, the above suggests that I'm wrong. Since you have a working setup, do you mind testing out this logging patch to try to root-cause this? I note that Bistro's HTTP UI is missing a handler for querying worker resources, I just never got around to adding it before being pulled to other projects. If we had that, it would shed light on what the scheduler is seeing. As it stands, I think this is the minimal informative patch:
diff --git a/bistro/bistro/worker/BistroWorkerHandler.cpp b/bistro/bistro/worker/BistroWorkerHandler.cpp
--- a/bistro/bistro/worker/BistroWorkerHandler.cpp
+++ b/bistro/bistro/worker/BistroWorkerHandler.cpp
@@ -676,8 +676,13 @@
// The monitor is created during the first healthcheck.
if (usablePhysicalResources_.monitor_) {
try {
- worker.usableResources =
- *usablePhysicalResources_.monitor_->getDataOrThrow();
+ auto new_rsrc = *usablePhysicalResources_.monitor_->getDataOrThrow();
+ if (new_rsrc != worker.usableResources) {
+ LOG(WARNING) << "Usable resources changed from "
+ << debugString(worker.usableResources) << " to "
+ << debugString(new_rsrc);
+ }
+ worker.usableResources = std::move(new_rsrc);
} catch (const std::exception& ex) {
LOG(WARNING) << "Failed to refresh worker's usable physical "
<< "resources: " << ex.what();
It compiles & passes tests so it's probably fine :D
I'll hold off merging it on trunk, because I have a sense there'll be more of these by the time we're done.
Thanks for your patience!
EDIT: By the way, I will try to find time to repro & investigate this locally, but it might not be until next week, and I don't want to block you that long.
Thank you very much for your time.
I modified BistroWorkerHandler.cpp
and I got the same output from the scheduler and this one from the worker:
W0519 08:57:42.489030 12017 server_socket.cpp:88] Found no 10 interfaces that are not link-local or loopback [356/471]I0519 08:57:42.490803 12017 LogWriter.cpp:77] Created table stderr
I0519 08:57:42.490844 12017 LogWriter.cpp:77] Created table stdout
I0519 08:57:42.490882 12017 LogWriter.cpp:77] Created table statuses
I0519 08:57:42.491230 12018 AutoTimer.h:143] Pruned logs with cutoff 1587279462 in 70.14 us
I0519 08:57:43.794057 12017 BistroWorkerHandler.cpp:100] Worker is ready: BistroWorker {
1: shard (string) = "alphacar-trx40-ip39",
2: machineLock (struct) = MachinePortLock {
1: hostname (string) = "alphacar-trx40-ip39",
2: port (i32) = 36603,
},
3: addr (struct) = ServiceAddress {
1: ip_or_host (string) = "10.203.78.195",
2: port (i32) = 36603,
},
4: id (struct) = BistroInstanceID {
1: startTime (i64) = 1589871463,
2: rand (i64) = 7051274761317265031,
},
5: heartbeatPeriodSec (i32) = 15,
6: protocolVersion (i16) = 2,
7: usableResources (struct) = UsablePhysicalResources {
1: msSinceEpoch (i64) = 0,
2: cpuCores (double) = 0.0,
3: memoryMB (double) = 0.0,
4: gpus (list) = list<struct>[0] {
},
},
}
W0519 08:57:43.798570 12023 BistroWorkerHandler.cpp:665] Waiting for this worker to start listening on ServiceAddress {
1: ip_or_host (string) = "10.203.78.195",
2: port (i32) = 36603,
}: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0519 08:57:44.802875 12159 AutoTimer.h:143] Query: 'SELECT job_id, node_id, time_and_count, line FROM statuses WHERE (time_and_count <= 0) ORDER BY time_and_count DESC LIMIT 2'; args: ' in 200 ns
I0519 08:57:44.803014 12159 LogWriter.cpp:218] Got 0 statuses lines
I0519 08:57:44.809809 12023 BistroWorkerHandler.cpp:703] Connected to new scheduler SchedulerHeartbeatResponse {
1: id (struct) = BistroInstanceID {
1: startTime (i64) = 1589871451,
2: rand (i64) = 2514503154668762103,
},
2: maxHealthcheckGap (i32) = 125,
3: heartbeatGracePeriod (i32) = 60,
4: loseUnhealthyWorkerAfter (i32) = 500,
5: workerCheckInterval (i32) = 5,
6: workerState (i32) = 0,
7: protocolVersion (i16) = 2,
8: workerSetID (struct) = WorkerSetID {
1: schedulerID (struct) = BistroInstanceID {
1: startTime (i64) = 1589871451,
2: rand (i64) = 2514503154668762103,
},
2: version (i64) = 1,
3: hash (struct) = WorkerSetHash {
1: startTime (struct) = SetHash {
1: addAll (i64) = 1589871463,
2: xorAll (i64) = 1589871463,
},
}, [300/471] 2: rand (struct) = SetHash {
1: addAll (i64) = 7051274761317265031,
2: xorAll (i64) = 7051274761317265031,
},
3: numWorkers (i32) = 1,
},
},
}
W0519 08:57:44.811563 12159 BistroWorkerHandler.cpp:279] CGroups set: CGroupOptions {
1: root (string) = "",
2: slice (string) = "",
3: subsystems (list) = list<string>[0] {
},
4: unitTestCreateFiles (bool) = false,
5: cpuShares (i16) = 0,
6: memoryLimitInBytes (i64) = 0,
7: killWithoutFreezer (bool) = false,
}
I0519 08:57:44.836437 12159 BistroWorkerHandler.cpp:295] Queueing healthcheck started at 1589871464
I0519 08:57:44.848788 12293 TaskSubprocessQueue.cpp:306] Healthcheck started at 1589871464 quit with status '{"time":1589871464,"result_bits":4}'
W0519 08:57:59.810143 12023 BistroWorkerHandler.cpp:681] Usable resources changed from UsablePhysicalResources {
1: msSinceEpoch (i64) = 0,
2: cpuCores (double) = 0.0,
3: memoryMB (double) = 0.0,
4: gpus (list) = list<struct>[0] {
},
} to UsablePhysicalResources {
1: msSinceEpoch (i64) = 1589871464811,
2: cpuCores (double) = 128.0,
3: memoryMB (double) = 128737.11328125,
4: gpus (list) = list<struct>[3] {
[0] = GPUInfo {
1: name (string) = "GeForce RTX 2080 SUPER",
2: pciBusID (string) = "00000000:01:00.0",
3: memoryMB (double) = 7979.0,
4: compute (double) = 1.0,
},
[1] = GPUInfo {
1: name (string) = "GeForce RTX 2080 SUPER",
2: pciBusID (string) = "00000000:21:00.0",
3: memoryMB (double) = 7982.0,
4: compute (double) = 1.0,
},
[2] = GPUInfo {
1: name (string) = "GeForce RTX 2080 SUPER",
2: pciBusID (string) = "00000000:4B:00.0",
3: memoryMB (double) = 7982.0,
4: compute (double) = 1.0,
},
},
}
I0519 08:58:00.799579 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus. [248/471]I0519 08:58:01.799788 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:02.799989 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:03.800141 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:04.800941 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:05.801256 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:06.801419 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:07.802026 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:08.802863 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:09.803040 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:10.803241 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:11.803947 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:12.804159 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:13.804358 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:14.804560 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
W0519 08:58:14.812839 12023 BistroWorkerHandler.cpp:681] Usable resources changed from UsablePhysicalResources {
1: msSinceEpoch (i64) = 0,
2: cpuCores (double) = 0.0,
3: memoryMB (double) = 0.0,
4: gpus (list) = list<struct>[0] {
},
} to UsablePhysicalResources {
1: msSinceEpoch (i64) = 1589871464811,
2: cpuCores (double) = 128.0,
3: memoryMB (double) = 128737.11328125,
4: gpus (list) = list<struct>[3] {
[0] = GPUInfo {
1: name (string) = "GeForce RTX 2080 SUPER",
2: pciBusID (string) = "00000000:01:00.0",
3: memoryMB (double) = 7979.0,
4: compute (double) = 1.0,
},
[1] = GPUInfo {
1: name (string) = "GeForce RTX 2080 SUPER",
2: pciBusID (string) = "00000000:21:00.0",
3: memoryMB (double) = 7982.0,
4: compute (double) = 1.0,
},
[2] = GPUInfo {
1: name (string) = "GeForce RTX 2080 SUPER",
2: pciBusID (string) = "00000000:4B:00.0",
3: memoryMB (double) = 7982.0,
4: compute (double) = 1.0,
},
},
}
I0519 08:58:14.815449 12023 BistroWorkerHandler.cpp:635] Became healthy
W0519 08:58:29.815603 12023 BistroWorkerHandler.cpp:681] Usable resources changed from UsablePhysicalResources {
1: msSinceEpoch (i64) = 0,
2: cpuCores (double) = 0.0,
3: memoryMB (double) = 0.0,
4: gpus (list) = list<struct>[0] {
},
} to UsablePhysicalResources {
1: msSinceEpoch (i64) = 1589871464811,
2: cpuCores (double) = 128.0,
3: memoryMB (double) = 128737.11328125,
4: gpus (list) = list<struct>[3] {
[0] = GPUInfo {
1: name (string) = "GeForce RTX 2080 SUPER",
2: pciBusID (string) = "00000000:01:00.0",
3: memoryMB (double) = 7979.0,
4: compute (double) = 1.0,
},
[1] = GPUInfo {
1: name (string) = "GeForce RTX 2080 SUPER",
2: pciBusID (string) = "00000000:21:00.0",
3: memoryMB (double) = 7982.0,
4: compute (double) = 1.0,
},
[2] = GPUInfo {
1: name (string) = "GeForce RTX 2080 SUPER",
2: pciBusID (string) = "00000000:4B:00.0",
3: memoryMB (double) = 7982.0,
4: compute (double) = 1.0,
},
},
}
[...]
Looks like worker.usableResources
is never updated...
I got the "state changed" part of the logging wrong, it shouldn't really matter what's in the worker_
member, it only matters what the scheduler sees. That said, updating worker_.usableResources
should be harmless, too, and would make the logging work as expected.
So then I think the follow-up question is what the scheduler thinks the resources are, which would answer why it can't schedule. I'll try to post a second patch soon.
I have been able to run this simple example. It looks like the mistake was in the configuration file...
I read into comment line 218 of Config.cpp
that was saying that worker level is by default at the bottom level and I just changed innodes
field: "levels": ["worker", "level1", "level2"],
to "levels": ["level1", "level2", "worker"]
.
I don't really understand why it should be like this but still, it seems to be working!
Thank you for your help!
EDIT: looks like my mistake was to put the "worker" level in the levels list. Setting the list as the default value "levels": ["level1", "level2"],
is fine.
Thanks for the update! I think Bistro is missing some error-checking around "worker" being present in the levels list. Sorry about this. It's been a while since I stared at a config, so I just trusted that mistakes like this would be caught automatically.
But clearly, this code is not that smart:
https://github.com/facebook/bistro/blob/master/bistro/config/Config.cpp#L193
I'll put up a patch to check for this error condition so nobody else has to wrangle with this.
Hi,
I'm trying to use bistro Discovering available physical resources capability in the toy example of the README without success. I run into two problems:
Bistro seems not to find any of the resources of my computer:
Running:
Among other, get this output (worker terminal):
It may be only some synchronization error as I am able to get my real computer configuration by running the following commands (the order is important): run
bistro_scheduler
then runbistro_worker
then killbistro_scheduler
then runbistro_scheduler
again (scheduler terminal):I'm not able to properly configure the
bistro_settings
with physical resourcesMy setting file is the following:
So if I had well understood the documentation all of my node will require 1 cpu to be run by the simple_job. And I expect that the limit of
ram
,cpu
andgpu
will be updated accordingly to the worker capability. But when I'm running the simple example with this configuration file, no jobs are launched and the scheduler is waiting for a worker with enough capability.I would be very grateful if you can give me some insight for solving my problem.
Nathan