facebookarchive / bistro

Bistro is a flexible distributed scheduler, a high-performance framework supporting multiple paradigms while retaining ease of configuration, management, and monitoring.
https://bistro.io
MIT License
1.03k stars 158 forks source link

Trouble when using host physical resources #40

Closed npiasco closed 4 years ago

npiasco commented 4 years ago

Hi,

I'm trying to use bistro Discovering available physical resources capability in the toy example of the README without success. I run into two problems:

Bistro seems not to find any of the resources of my computer:

Running:

./cmake/Debug/worker/bistro_worker --scheduler_host=:: --scheduler_port=6789 --worker_command="$HOME/demo_bistro_task.sh" --data_dir=/tmp/bistro_worker --nvidia_smi=/usr/bin/nvidia-smi

Among other, get this output (worker terminal):

I0515 18:47:10.139637 107952 BistroWorkerHandler.cpp:100] Worker is ready: BistroWorker { [...] 7: usableResources (struct) = UsablePhysicalResources { 1: msSinceEpoch (i64) = 0, 2: cpuCores (double) = 0.0, 3: memoryMB (double) = 0.0, 4: gpus (list) = list[0] { }, }, }

It may be only some synchronization error as I am able to get my real computer configuration by running the following commands (the order is important): run bistro_scheduler then run bistro_worker then kill bistro_scheduler then run bistro_scheduler again (scheduler terminal):

I0515 18:12:31.293471 107047 AutoTimer.h:143] Got 0 running tasks from worker BistroWorker { [...] 7: usableResources (struct) = UsablePhysicalResources { 1: msSinceEpoch (i64) = 1589559121458, 2: cpuCores (double) = 128.0, 3: memoryMB (double) = 128737.11328125, 4: gpus (list) = list[3] { [0] = GPUInfo { 1: name (string) = "GeForce RTX 2080 SUPER", 2: pciBusID (string) = "00000000:01:00.0", 3: memoryMB (double) = 7979.0, 4: compute (double) = 1.0, }, [1] = GPUInfo { 1: name (string) = "GeForce RTX 2080 SUPER", 2: pciBusID (string) = "00000000:21:00.0", 3: memoryMB (double) = 7982.0, 4: compute (double) = 1.0, }, [2] = GPUInfo { 1: name (string) = "GeForce RTX 2080 SUPER", 2: pciBusID (string) = "00000000:4B:00.0", 3: memoryMB (double) = 7982.0, 4: compute (double) = 1.0, }, },

I'm not able to properly configure the bistro_settings with physical resources

My setting file is the following:

{

"bistro_settings" : {
  "resources" : {
    "worker" : {
      "ram" : {
        "limit" : 0,
        "default" : 0
      },
      "cpu" : {
        "limit" : 0,
        "default" : 1
      },
      "gpu" : {
        "limit" : 0,
        "default" : 0
      }
    }
  },
  "nodes" : {
    "levels": ["worker", "level1", "level2"],
    "node_sources": [{
      "source": "manual",
      "prefs": {
        "node1": ["node11", "node12"],
        "node2": ["node21", "node22"]
      }
    }]
  },
  "enabled" : true,
  "physical_resources": {
    "ram_mb": {
        "logical_resource": "ram",
        "multiply_logical_by": 1024,
        "physical_reserve_amount": 4096
    },
    "cpu_core": {
        "logical_resource": "cpu",
        "enforcement": "none"
    },
    "gpu_card": {
        "logical_resource": "gpu"
    }
  },
},

"bistro_job->simple_job" : {
  "owner" : "test",
  "enabled" : true
}
}

So if I had well understood the documentation all of my node will require 1 cpu to be run by the simple_job. And I expect that the limit of ram, cpu and gpu will be updated accordingly to the worker capability. But when I'm running the simple example with this configuration file, no jobs are launched and the scheduler is waiting for a worker with enough capability.

I would be very grateful if you can give me some insight for solving my problem.

Nathan

snarkmaster commented 4 years ago

I read through BistroWorkerHandler.{cpp,h} with a bit of attention, and I think what's going on is this. Resource reporting does not yet start when the worker is constructed -- so your log with an empty struct from line 100 is not diagnostic.

For the scheduler to learn the worker's resources, two things must happen:

So looking at your description, I don't see anything that shouldn't work. However, it might take healthcheck_interval + 2 * heartbeat_interval seconds for the scheduler to be able to run anything on the worker.

I might be missing something here.

It would be easier to be sure if you could share the full scheduler & worker logs for the failure scenario.

npiasco commented 4 years ago

Thank you for your quick answer. Here the log of the worker launched with cmd ./cmake/Debug/worker/bistro_worker --scheduler_host=:: --scheduler_port=6789 --worker_command="$HOME/demo_bistro_task.sh" --data_dir=/tmp/bistro_worker --nvidia_smi=/usr/bin/nvidia-smi:

W0518 18:05:15.937683  9049 server_socket.cpp:88] Found no 10 interfaces that are not link-local or loopback
I0518 18:05:15.939427  9049 LogWriter.cpp:77] Created table stderr
I0518 18:05:15.939468  9049 LogWriter.cpp:77] Created table stdout
I0518 18:05:15.939507  9049 LogWriter.cpp:77] Created table statuses
I0518 18:05:15.939855  9050 AutoTimer.h:143] Pruned logs with cutoff 1587225915 in 67.98 us
I0518 18:05:17.254093  9049 BistroWorkerHandler.cpp:100] Worker is ready: BistroWorker {
  1: shard (string) = "alphacar-trx40-ip39",
  2: machineLock (struct) = MachinePortLock {
    1: hostname (string) = "alphacar-trx40-ip39",
    2: port (i32) = 33237,
  },
  3: addr (struct) = ServiceAddress {
    1: ip_or_host (string) = "10.203.78.195",
    2: port (i32) = 33237,
  },
  4: id (struct) = BistroInstanceID {
    1: startTime (i64) = 1589817917,
    2: rand (i64) = 3980752265297529627,
  },
  5: heartbeatPeriodSec (i32) = 15,
  6: protocolVersion (i16) = 2,
  7: usableResources (struct) = UsablePhysicalResources {
    1: msSinceEpoch (i64) = 0,
    2: cpuCores (double) = 0.0,
    3: memoryMB (double) = 0.0,
    4: gpus (list) = list<struct>[0] {
    },
  },
}
W0518 18:05:17.258365  9055 BistroWorkerHandler.cpp:665] Waiting for this worker to start listening on ServiceAddress {
  1: ip_or_host (string) = "10.203.78.195",
  2: port (i32) = 33237,
}: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0518 18:05:18.263059  9191 AutoTimer.h:143] Query: 'SELECT job_id, node_id, time_and_count, line FROM statuses WHERE (time_and_count <= 0) ORDER BY time_and_count DESC LIMIT 2'; args: ' in 300 ns
I0518 18:05:18.263224  9191 LogWriter.cpp:218] Got 0 statuses lines
I0518 18:05:18.270023  9055 BistroWorkerHandler.cpp:698] Connected to new scheduler SchedulerHeartbeatResponse {
  1: id (struct) = BistroInstanceID {
    1: startTime (i64) = 1589817900,
    2: rand (i64) = -5012333484077636209,
  },
  2: maxHealthcheckGap (i32) = 125,
  3: heartbeatGracePeriod (i32) = 60,
  4: loseUnhealthyWorkerAfter (i32) = 500,
  5: workerCheckInterval (i32) = 5,
  6: workerState (i32) = 0,
  7: protocolVersion (i16) = 2,
  8: workerSetID (struct) = WorkerSetID {
    1: schedulerID (struct) = BistroInstanceID {
      1: startTime (i64) = 1589817900,
      2: rand (i64) = -5012333484077636209,
    },
    2: version (i64) = 1,
    3: hash (struct) = WorkerSetHash {
      1: startTime (struct) = SetHash {
        1: addAll (i64) = 1589817917,
        2: xorAll (i64) = 1589817917,
      },
      2: rand (struct) = SetHash {
        1: addAll (i64) = 3980752265297529627,
        2: xorAll (i64) = 3980752265297529627,
      },
      3: numWorkers (i32) = 1,
    },
  },
}
W0518 18:05:18.272003  9190 BistroWorkerHandler.cpp:279] CGroups set: CGroupOptions {
  1: root (string) = "",
  2: slice (string) = "",
  3: subsystems (list) = list<string>[0] {
  },
  4: unitTestCreateFiles (bool) = false,
  5: cpuShares (i16) = 0,
  6: memoryLimitInBytes (i64) = 0,
  7: killWithoutFreezer (bool) = false,
}
I0518 18:05:18.295586  9190 BistroWorkerHandler.cpp:295] Queueing healthcheck started at 1589817918
I0518 18:05:18.309382  9325 TaskSubprocessQueue.cpp:306] Healthcheck started at 1589817918 quit with status '{"result_bits":4,"time":1589817918}'
I0518 18:05:34.259470  9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:35.259667  9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:36.259869  9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:37.260035  9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:38.260329  9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:39.260982  9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:40.261147  9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:41.261797  9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:42.262528  9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:43.262707  9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:44.262872  9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:45.263036  9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:46.263200  9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:47.263365  9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:48.263525  9054 BistroWorkerHandler.cpp:786] Will be healthy upon achieving WorkerSetID consensus.
I0518 18:05:48.275070  9055 BistroWorkerHandler.cpp:635] Became healthy
I0518 18:06:18.281457  9190 BistroWorkerHandler.cpp:295] Queueing healthcheck started at 1589817978
I0518 18:06:18.292491  9331 TaskSubprocessQueue.cpp:306] Healthcheck started at 1589817978 quit with status '{"result_bits":4,"time":1589817978}'
I0518 18:07:18.291656  9190 BistroWorkerHandler.cpp:295] Queueing healthcheck started at 1589818038
I0518 18:07:18.304718  9336 TaskSubprocessQueue.cpp:306] Healthcheck started at 1589818038 quit with status '{"result_bits":4,"time":1589818038}'
I0518 18:08:18.301235  9190 BistroWorkerHandler.cpp:295] Queueing healthcheck started at 1589818098
I0518 18:08:18.312572  9341 TaskSubprocessQueue.cpp:306] Healthcheck started at 1589818098 quit with status '{"result_bits":4,"time":1589818098}'

Log from the sheduler with command ./cmake/Debug/server/bistro_scheduler --server_port=6789 --http_server_port=6855 --config_file=scripts/test_configs/pys_res --CAUTION_startup_wait_f or_workers=1 --instance_node_name=scheduler --clean_statuses:

I0518 18:05:00.648990  8755 AutoTimer.h:143] Read config from /media/alphacar/XFS/workspace/lib/bistro/bistro/scripts/test_configs/pys_res in 33.18 us
I0518 18:05:00.649794  8755 AutoTimer.h:143] Parsed config with 1 jobs in 736.1 us
I0518 18:05:00.649979  8755 AutoTimer.h:143] Have 7 nodes after manual in 40.5 us
I0518 18:05:00.650238  8758 Monitor.cpp:77] Updating monitor histogram (/media/alphacar/XFS/workspace/lib/bistro/bistro/monitor/Monitor.cpp:63): Monitor transiently not making a histogram for simple_job since it is not loaded
E0518 18:05:00.650840  8755 HTTPMonitorServer.cpp:110] Not enabling HTTPS support because one of the environment variables THRIFT_TLS_CL_CERT_PATH, THRIFT_TLS_CL_KEY_PATH, or THRIFT_TLS_SRV_CA_PATH was not set
W0518 18:05:00.650913  8760 RemoteWorkerRunner.cpp:93] RemoteWorkerRunner initial wait (/media/alphacar/XFS/workspace/lib/bistro/bistro/runners/RemoteWorkerRunner.cpp:79): DANGER! DANGER! Your --CAUTION_startup_wait_for_workers of 1 is l
ower than the max healthcheck gap of 125, which makes it very likely that you will start second copies of tasks that are already running (unless your heartbeat interval is much smaller). No initial worker set ID consensus. Waiting for al
l workers to connect before running tasks.
I0518 18:05:00.651414  8761 Bistro.cpp:182] Idle wait...
I0518 18:05:00.655133  8755 HTTPMonitorServer.cpp:171] Launched HTTP Monitor Server on port 6855, result 0
I0518 18:05:05.651787  8761 Bistro.cpp:182] Idle wait...
I0518 18:05:10.650570  8756 AutoTimer.h:143] Config was already fresh in 21.39 us
I0518 18:05:10.652154  8761 Bistro.cpp:182] Idle wait...
I0518 18:05:15.652770  8761 Bistro.cpp:182] Idle wait...
W0518 18:05:18.268651  8898 RemoteWorkers.cpp:71] Worker alphacar-trx40-ip39 is not yet able to achieve consensus: It has no WorkerSetID
I0518 18:05:18.268687  8898 RemoteWorker.cpp:182] Initializing new worker: BistroWorker {
  1: shard (string) = "alphacar-trx40-ip39",
  2: machineLock (struct) = MachinePortLock {
    1: hostname (string) = "alphacar-trx40-ip39",
    2: port (i32) = 33237,
  },
  3: addr (struct) = ServiceAddress {
    1: ip_or_host (string) = "10.203.78.195",
    2: port (i32) = 33237,
  },
  4: id (struct) = BistroInstanceID {
    1: startTime (i64) = 1589817917,
    2: rand (i64) = 3980752265297529627,
  },
  5: heartbeatPeriodSec (i32) = 15,
  6: protocolVersion (i16) = 2,
  7: usableResources (struct) = UsablePhysicalResources {
    1: msSinceEpoch (i64) = 0,
    2: cpuCores (double) = 0.0,
    3: memoryMB (double) = 0.0,
    4: gpus (list) = list<struct>[0] {
    },
  },
}
I0518 18:05:18.269150  8898 AutoTimer.h:143] Sent healthchecks to 1 workers in 28.63 us
I0518 18:05:18.272475  8759 AutoTimer.h:143] Got 0 running tasks from worker BistroWorker {
  1: shard (string) = "alphacar-trx40-ip39",
  2: machineLock (struct) = MachinePortLock {
    1: hostname (string) = "alphacar-trx40-ip39",
    2: port (i32) = 33237,
  },
  3: addr (struct) = ServiceAddress {
    1: ip_or_host (string) = "10.203.78.195",
    2: port (i32) = 33237,
  },
  4: id (struct) = BistroInstanceID {
    1: startTime (i64) = 1589817917,
    2: rand (i64) = 3980752265297529627,
  },
  5: heartbeatPeriodSec (i32) = 15,
  6: protocolVersion (i16) = 2,
  7: usableResources (struct) = UsablePhysicalResources {
    1: msSinceEpoch (i64) = 0,
    2: cpuCores (double) = 0.0,
    3: memoryMB (double) = 0.0,
    4: gpus (list) = list<struct>[0] {
    },
  },
} in 1.789 ms
I0518 18:05:18.272531  8759 AutoTimer.h:143] Recorded 0 new running tasks for alphacar-trx40-ip39 in 10.25 us
I0518 18:05:19.260581  8898 RemoteWorker.cpp:279] Good healthcheck from alphacar-trx40-ip39 sent 1589817918
I0518 18:05:20.650768  8756 AutoTimer.h:143] Config was already fresh in 20.97 us
W0518 18:05:20.652724  8760 RemoteWorkers.cpp:366] Not pruning history until 1 workers echo their first WorkerSetID from this scheduler
W0518 18:05:20.652770  8760 RemoteWorkers.cpp:71] Worker alphacar-trx40-ip39 is not yet able to achieve consensus: It has no WorkerSetID
I0518 18:05:20.652782  8760 RemoteWorker.cpp:171] Worker alphacar-trx40-ip39 can be healthy but lacks WorkerSetID consensus
I0518 18:05:20.653138  8761 Bistro.cpp:182] Idle wait...
W0518 18:05:25.652928  8760 RemoteWorkers.cpp:366] Not pruning history until 1 workers echo their first WorkerSetID from this scheduler
W0518 18:05:25.652969  8760 RemoteWorkers.cpp:71] Worker alphacar-trx40-ip39 is not yet able to achieve consensus: It has no WorkerSetID
I0518 18:05:25.652981  8760 RemoteWorker.cpp:171] Worker alphacar-trx40-ip39 can be healthy but lacks WorkerSetID consensus
I0518 18:05:25.653486  8761 Bistro.cpp:182] Idle wait...
I0518 18:05:30.651010  8756 AutoTimer.h:143] Config was already fresh in 22.61 us
W0518 18:05:30.653170  8760 RemoteWorkers.cpp:366] Not pruning history until 1 workers echo their first WorkerSetID from this scheduler
W0518 18:05:30.653215  8760 RemoteWorkers.cpp:71] Worker alphacar-trx40-ip39 is not yet able to achieve consensus: It has no WorkerSetID
I0518 18:05:30.653228  8760 RemoteWorker.cpp:171] Worker alphacar-trx40-ip39 can be healthy but lacks WorkerSetID consensus
I0518 18:05:30.653877  8761 Bistro.cpp:182] Idle wait...
W0518 18:05:33.272017  8898 RemoteWorkers.cpp:71] Worker alphacar-trx40-ip39 is not yet able to achieve consensus: It has no WorkerSetID
I0518 18:05:33.272078  8898 RemoteWorker.cpp:171] Worker alphacar-trx40-ip39 can be healthy but lacks WorkerSetID consensus
I0518 18:05:35.653426  8760 RemoteWorkers.cpp:135] Worker alphacar-trx40-ip39 has not been healthy, but WorkerSetID consensus allows it.
I0518 18:05:35.653462  8760 RemoteWorker.cpp:201] Worker alphacar-trx40-ip39 became healthy
I0518 18:05:35.654237  8761 Bistro.cpp:182] Idle wait...
I0518 18:05:40.651409  8756 AutoTimer.h:143] Config was already fresh in 155.3 us
I0518 18:05:40.654633  8761 Bistro.cpp:182] Idle wait...
I0518 18:05:45.655014  8761 Bistro.cpp:182] Idle wait...
I0518 18:05:50.652385  8756 AutoTimer.h:143] Config was already fresh in 21.33 us
I0518 18:05:50.655412  8761 Bistro.cpp:182] Idle wait...
I0518 18:05:55.655776  8761 Bistro.cpp:182] Idle wait...
I0518 18:06:00.652582  8756 AutoTimer.h:143] Config was already fresh in 22.6 us
I0518 18:06:00.656126  8761 Bistro.cpp:182] Idle wait...
I0518 18:06:05.656486  8761 Bistro.cpp:182] Idle wait...
I0518 18:06:10.653295  8756 AutoTimer.h:143] Config was already fresh in 22.15 us
I0518 18:06:10.656982  8761 Bistro.cpp:182] Idle wait...
I0518 18:06:15.657380  8761 Bistro.cpp:182] Idle wait...
I0518 18:06:18.279641  8898 AutoTimer.h:143] Sent healthchecks to 1 workers in 49.04 us
I0518 18:06:18.302333  8898 RemoteWorker.cpp:279] Good healthcheck from alphacar-trx40-ip39 sent 1589817978
I0518 18:06:20.653497  8756 AutoTimer.h:143] Config was already fresh in 20.78 us
I0518 18:06:20.657994  8761 Bistro.cpp:182] Idle wait...
I0518 18:06:25.658383  8761 Bistro.cpp:182] Idle wait...
I0518 18:06:30.653698  8756 AutoTimer.h:143] Config was already fresh in 20.96 us
I0518 18:06:30.658742  8761 Bistro.cpp:182] Idle wait...
I0518 18:06:35.659106  8761 Bistro.cpp:182] Idle wait...
I0518 18:06:40.653894  8756 AutoTimer.h:143] Config was already fresh in 21.59 us
I0518 18:06:40.659466  8761 Bistro.cpp:182] Idle wait...
I0518 18:06:45.659816  8761 Bistro.cpp:182] Idle wait...
I0518 18:06:50.654094  8756 AutoTimer.h:143] Config was already fresh in 20.96 us
I0518 18:06:50.660179  8761 Bistro.cpp:182] Idle wait...
I0518 18:06:55.661064  8761 Bistro.cpp:182] Idle wait...
I0518 18:07:00.654352  8756 AutoTimer.h:143] Config was already fresh in 22.81 us
I0518 18:07:00.661697  8761 Bistro.cpp:182] Idle wait...
I0518 18:07:05.662623  8761 Bistro.cpp:182] Idle wait...
I0518 18:07:10.654552  8756 AutoTimer.h:143] Config was already fresh in 20.71 us
I0518 18:07:10.662998  8761 Bistro.cpp:182] Idle wait...
I0518 18:07:15.663394  8761 Bistro.cpp:182] Idle wait...
I0518 18:07:18.289851  8898 AutoTimer.h:143] Sent healthchecks to 1 workers in 47.43 us
I0518 18:07:18.349030  8898 RemoteWorker.cpp:279] Good healthcheck from alphacar-trx40-ip39 sent 1589818038
I0518 18:07:20.654755  8756 AutoTimer.h:143] Config was already fresh in 20.81 us
I0518 18:07:20.663755  8761 Bistro.cpp:182] Idle wait...
I0518 18:07:25.664116  8761 Bistro.cpp:182] Idle wait...
I0518 18:07:30.654960  8756 AutoTimer.h:143] Config was already fresh in 20.91 us
I0518 18:07:30.664467  8761 Bistro.cpp:182] Idle wait...
I0518 18:07:35.664829  8761 Bistro.cpp:182] Idle wait...
I0518 18:07:40.655169  8756 AutoTimer.h:143] Config was already fresh in 20.66 us
I0518 18:07:40.665220  8761 Bistro.cpp:182] Idle wait...
I0518 18:07:45.665578  8761 Bistro.cpp:182] Idle wait...
I0518 18:07:50.655923  8756 AutoTimer.h:143] Config was already fresh in 29.84 us
I0518 18:07:50.666440  8761 Bistro.cpp:182] Idle wait...
I0518 18:07:55.666800  8761 Bistro.cpp:182] Idle wait...
I0518 18:08:00.656145  8756 AutoTimer.h:143] Config was already fresh in 21.33 us
I0518 18:08:00.667163  8761 Bistro.cpp:182] Idle wait...
I0518 18:08:05.667513  8761 Bistro.cpp:182] Idle wait...
I0518 18:08:10.656348  8756 AutoTimer.h:143] Config was already fresh in 22.24 us
I0518 18:08:10.667873  8761 Bistro.cpp:182] Idle wait...
I0518 18:08:15.668234  8761 Bistro.cpp:182] Idle wait...
I0518 18:08:18.299465  8898 AutoTimer.h:143] Sent healthchecks to 1 workers in 47.73 us
I0518 18:08:18.393492  8898 RemoteWorker.cpp:279] Good healthcheck from alphacar-trx40-ip39 sent 1589818098
I0518 18:08:20.656545  8756 AutoTimer.h:143] Config was already fresh in 18.3 us
I0518 18:08:20.668592  8761 Bistro.cpp:182] Idle wait...
I0518 18:08:25.668946  8761 Bistro.cpp:182] Idle wait...
I0518 18:08:30.657305  8756 AutoTimer.h:143] Config was already fresh in 30.63 us
I0518 18:08:30.669996  8761 Bistro.cpp:182] Idle wait...
I0518 18:08:35.670406  8761 Bistro.cpp:182] Idle wait...
I0518 18:08:40.657521  8756 AutoTimer.h:143] Config was already fresh in 21.37 us
I0518 18:08:40.671245  8761 Bistro.cpp:182] Idle wait...
I0518 18:08:45.671584  8761 Bistro.cpp:182] Idle wait...
I0518 18:08:50.657724  8756 AutoTimer.h:143] Config was already fresh in 20.42 us
I0518 18:08:50.671943  8761 Bistro.cpp:182] Idle wait...
I0518 18:08:55.672304  8761 Bistro.cpp:182] Idle wait...
I0518 18:09:00.657964  8756 AutoTimer.h:143] Config was already fresh in 22.51 us
I0518 18:09:00.672636  8761 Bistro.cpp:182] Idle wait...
I0518 18:09:05.672991  8761 Bistro.cpp:182] Idle wait...
snarkmaster commented 4 years ago

Well, the above suggests that I'm wrong. Since you have a working setup, do you mind testing out this logging patch to try to root-cause this? I note that Bistro's HTTP UI is missing a handler for querying worker resources, I just never got around to adding it before being pulled to other projects. If we had that, it would shed light on what the scheduler is seeing. As it stands, I think this is the minimal informative patch:

diff --git a/bistro/bistro/worker/BistroWorkerHandler.cpp b/bistro/bistro/worker/BistroWorkerHandler.cpp
--- a/bistro/bistro/worker/BistroWorkerHandler.cpp
+++ b/bistro/bistro/worker/BistroWorkerHandler.cpp
@@ -676,8 +676,13 @@
       // The monitor is created during the first healthcheck.
       if (usablePhysicalResources_.monitor_) {
         try {
-          worker.usableResources =
-            *usablePhysicalResources_.monitor_->getDataOrThrow();
+          auto new_rsrc = *usablePhysicalResources_.monitor_->getDataOrThrow();
+          if (new_rsrc != worker.usableResources) {
+            LOG(WARNING) << "Usable resources changed from "
+              << debugString(worker.usableResources) << " to "
+              << debugString(new_rsrc);
+          }
+          worker.usableResources = std::move(new_rsrc);
         } catch (const std::exception& ex) {
           LOG(WARNING) << "Failed to refresh worker's usable physical "
             << "resources: " << ex.what();

It compiles & passes tests so it's probably fine :D

I'll hold off merging it on trunk, because I have a sense there'll be more of these by the time we're done.

Thanks for your patience!

EDIT: By the way, I will try to find time to repro & investigate this locally, but it might not be until next week, and I don't want to block you that long.

npiasco commented 4 years ago

Thank you very much for your time. I modified BistroWorkerHandler.cpp and I got the same output from the scheduler and this one from the worker:

W0519 08:57:42.489030 12017 server_socket.cpp:88] Found no 10 interfaces that are not link-local or loopback                                                                                [356/471]I0519 08:57:42.490803 12017 LogWriter.cpp:77] Created table stderr
I0519 08:57:42.490844 12017 LogWriter.cpp:77] Created table stdout
I0519 08:57:42.490882 12017 LogWriter.cpp:77] Created table statuses
I0519 08:57:42.491230 12018 AutoTimer.h:143] Pruned logs with cutoff 1587279462 in 70.14 us
I0519 08:57:43.794057 12017 BistroWorkerHandler.cpp:100] Worker is ready: BistroWorker {
  1: shard (string) = "alphacar-trx40-ip39",
  2: machineLock (struct) = MachinePortLock {
    1: hostname (string) = "alphacar-trx40-ip39",
    2: port (i32) = 36603,
  },
  3: addr (struct) = ServiceAddress {
    1: ip_or_host (string) = "10.203.78.195",
    2: port (i32) = 36603,
  },
  4: id (struct) = BistroInstanceID {
    1: startTime (i64) = 1589871463,
    2: rand (i64) = 7051274761317265031,
  },
  5: heartbeatPeriodSec (i32) = 15,
  6: protocolVersion (i16) = 2,
  7: usableResources (struct) = UsablePhysicalResources {
    1: msSinceEpoch (i64) = 0,
    2: cpuCores (double) = 0.0,
    3: memoryMB (double) = 0.0,
    4: gpus (list) = list<struct>[0] {
    },
  },
}
W0519 08:57:43.798570 12023 BistroWorkerHandler.cpp:665] Waiting for this worker to start listening on ServiceAddress {
  1: ip_or_host (string) = "10.203.78.195",
  2: port (i32) = 36603,
}: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0519 08:57:44.802875 12159 AutoTimer.h:143] Query: 'SELECT job_id, node_id, time_and_count, line FROM statuses WHERE (time_and_count <= 0) ORDER BY time_and_count DESC LIMIT 2'; args: ' in 200 ns
I0519 08:57:44.803014 12159 LogWriter.cpp:218] Got 0 statuses lines
I0519 08:57:44.809809 12023 BistroWorkerHandler.cpp:703] Connected to new scheduler SchedulerHeartbeatResponse {
  1: id (struct) = BistroInstanceID {
    1: startTime (i64) = 1589871451,
    2: rand (i64) = 2514503154668762103,
  },
  2: maxHealthcheckGap (i32) = 125,
  3: heartbeatGracePeriod (i32) = 60,
  4: loseUnhealthyWorkerAfter (i32) = 500,
  5: workerCheckInterval (i32) = 5,
  6: workerState (i32) = 0,
  7: protocolVersion (i16) = 2,
  8: workerSetID (struct) = WorkerSetID {
    1: schedulerID (struct) = BistroInstanceID {
      1: startTime (i64) = 1589871451,
      2: rand (i64) = 2514503154668762103,
    },
    2: version (i64) = 1,
    3: hash (struct) = WorkerSetHash {
      1: startTime (struct) = SetHash {
        1: addAll (i64) = 1589871463,
        2: xorAll (i64) = 1589871463,
      },
      },                                                                                                                                                                                    [300/471]      2: rand (struct) = SetHash {
        1: addAll (i64) = 7051274761317265031,
        2: xorAll (i64) = 7051274761317265031,
      },
      3: numWorkers (i32) = 1,
    },
  },
}
W0519 08:57:44.811563 12159 BistroWorkerHandler.cpp:279] CGroups set: CGroupOptions {
  1: root (string) = "",
  2: slice (string) = "",
  3: subsystems (list) = list<string>[0] {
  },
  4: unitTestCreateFiles (bool) = false,
  5: cpuShares (i16) = 0,
  6: memoryLimitInBytes (i64) = 0,
  7: killWithoutFreezer (bool) = false,
}
I0519 08:57:44.836437 12159 BistroWorkerHandler.cpp:295] Queueing healthcheck started at 1589871464
I0519 08:57:44.848788 12293 TaskSubprocessQueue.cpp:306] Healthcheck started at 1589871464 quit with status '{"time":1589871464,"result_bits":4}'
W0519 08:57:59.810143 12023 BistroWorkerHandler.cpp:681] Usable resources changed from UsablePhysicalResources {
  1: msSinceEpoch (i64) = 0,
  2: cpuCores (double) = 0.0,
  3: memoryMB (double) = 0.0,
  4: gpus (list) = list<struct>[0] {
  },
} to UsablePhysicalResources {
  1: msSinceEpoch (i64) = 1589871464811,
  2: cpuCores (double) = 128.0,
  3: memoryMB (double) = 128737.11328125,
  4: gpus (list) = list<struct>[3] {
    [0] = GPUInfo {
      1: name (string) = "GeForce RTX 2080 SUPER",
      2: pciBusID (string) = "00000000:01:00.0",
      3: memoryMB (double) = 7979.0,
      4: compute (double) = 1.0,
    },
    [1] = GPUInfo {
      1: name (string) = "GeForce RTX 2080 SUPER",
      2: pciBusID (string) = "00000000:21:00.0",
      3: memoryMB (double) = 7982.0,
      4: compute (double) = 1.0,
    },
    [2] = GPUInfo {
      1: name (string) = "GeForce RTX 2080 SUPER",
      2: pciBusID (string) = "00000000:4B:00.0",
      3: memoryMB (double) = 7982.0,
      4: compute (double) = 1.0,
    },
  },
}
I0519 08:58:00.799579 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.                                                                              [248/471]I0519 08:58:01.799788 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:02.799989 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:03.800141 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:04.800941 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:05.801256 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:06.801419 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:07.802026 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:08.802863 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:09.803040 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:10.803241 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:11.803947 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:12.804159 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:13.804358 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
I0519 08:58:14.804560 12022 BistroWorkerHandler.cpp:791] Will be healthy upon achieving WorkerSetID consensus.
W0519 08:58:14.812839 12023 BistroWorkerHandler.cpp:681] Usable resources changed from UsablePhysicalResources {
  1: msSinceEpoch (i64) = 0,
  2: cpuCores (double) = 0.0,
  3: memoryMB (double) = 0.0,
  4: gpus (list) = list<struct>[0] {
  },
} to UsablePhysicalResources {
  1: msSinceEpoch (i64) = 1589871464811,
  2: cpuCores (double) = 128.0,
  3: memoryMB (double) = 128737.11328125,
  4: gpus (list) = list<struct>[3] {
    [0] = GPUInfo {
      1: name (string) = "GeForce RTX 2080 SUPER",
      2: pciBusID (string) = "00000000:01:00.0",
      3: memoryMB (double) = 7979.0,
      4: compute (double) = 1.0,
    },
    [1] = GPUInfo {
      1: name (string) = "GeForce RTX 2080 SUPER",
      2: pciBusID (string) = "00000000:21:00.0",
      3: memoryMB (double) = 7982.0,
      4: compute (double) = 1.0,
    },
    [2] = GPUInfo {
      1: name (string) = "GeForce RTX 2080 SUPER",
      2: pciBusID (string) = "00000000:4B:00.0",
      3: memoryMB (double) = 7982.0,
      4: compute (double) = 1.0,
    },
  },
}
I0519 08:58:14.815449 12023 BistroWorkerHandler.cpp:635] Became healthy
W0519 08:58:29.815603 12023 BistroWorkerHandler.cpp:681] Usable resources changed from UsablePhysicalResources {
  1: msSinceEpoch (i64) = 0,
  2: cpuCores (double) = 0.0,
  3: memoryMB (double) = 0.0,
  4: gpus (list) = list<struct>[0] {
  },
} to UsablePhysicalResources {
  1: msSinceEpoch (i64) = 1589871464811,
  2: cpuCores (double) = 128.0,
  3: memoryMB (double) = 128737.11328125,
  4: gpus (list) = list<struct>[3] {
    [0] = GPUInfo {
      1: name (string) = "GeForce RTX 2080 SUPER",
      2: pciBusID (string) = "00000000:01:00.0",
      3: memoryMB (double) = 7979.0,
      4: compute (double) = 1.0,
    },
    [1] = GPUInfo {
      1: name (string) = "GeForce RTX 2080 SUPER",
      2: pciBusID (string) = "00000000:21:00.0",
      3: memoryMB (double) = 7982.0,
      4: compute (double) = 1.0,
    },
    [2] = GPUInfo {
      1: name (string) = "GeForce RTX 2080 SUPER",
      2: pciBusID (string) = "00000000:4B:00.0",
      3: memoryMB (double) = 7982.0,
      4: compute (double) = 1.0,
    },
  },
}
[...]

Looks like worker.usableResources is never updated...

snarkmaster commented 4 years ago

I got the "state changed" part of the logging wrong, it shouldn't really matter what's in the worker_ member, it only matters what the scheduler sees. That said, updating worker_.usableResources should be harmless, too, and would make the logging work as expected.

So then I think the follow-up question is what the scheduler thinks the resources are, which would answer why it can't schedule. I'll try to post a second patch soon.

npiasco commented 4 years ago

I have been able to run this simple example. It looks like the mistake was in the configuration file...

I read into comment line 218 of Config.cpp that was saying that worker level is by default at the bottom level and I just changed innodes field: "levels": ["worker", "level1", "level2"], to "levels": ["level1", "level2", "worker"]. I don't really understand why it should be like this but still, it seems to be working! Thank you for your help!

EDIT: looks like my mistake was to put the "worker" level in the levels list. Setting the list as the default value "levels": ["level1", "level2"], is fine.

snarkmaster commented 4 years ago

Thanks for the update! I think Bistro is missing some error-checking around "worker" being present in the levels list. Sorry about this. It's been a while since I stared at a config, so I just trusted that mistakes like this would be caught automatically.

But clearly, this code is not that smart:

https://github.com/facebook/bistro/blob/master/bistro/config/Config.cpp#L193

I'll put up a patch to check for this error condition so nobody else has to wrangle with this.