facebookarchive / bistro

Bistro is a flexible distributed scheduler, a high-performance framework supporting multiple paradigms while retaining ease of configuration, management, and monitoring.
https://bistro.io
MIT License
1.03k stars 158 forks source link

Example doesn't work with Docker-based build #19

Closed k-stanislawek closed 6 years ago

k-stanislawek commented 6 years ago

Hi, thank you for contributing this great tool to Github, I couldn't find any similar tool, not as simple at least.

However, I have a problem with running example program on Docker build. TL;DR: when trying to connect worker with server, I have an error: 111 (Connection Refused). I've checked the port (with lsof -i) from worker's terminal and server indeed listens on 6789 on both ipv4 and ipv6.

First, I've had some issues with build of "master" branch. IIRC some build script was using thrift1 command, instead of /home/install/bin/thrift1. I've looked at issue tracker and found this: https://github.com/facebook/bistro/issues/18 , and I used the commit pointed here (044cd9f...). It worked: build finished, even though some tests fail, but binaries were built and they work. For note, my command for making the Docker image: os_image=ubuntu:16.04 gcc_version=5 make_parallelism=2 travis_cache_dir=~/travis_ccache ./fbcode_builder/travis_docker_build.sh &> build_at_$(date +'%Y%m%d_%H%M%S').log

Then I connected to my image (using instructions from https://github.com/facebook/bistro/blob/master/build/fbcode_builder/README.docker) and tried to run the example from here: https://github.com/facebook/bistro/blob/master/README.md#your-first-bistro-run. I'm running both exactly the same commands as in README, in directory /home/bistro/bistro, on the same docker session, using screen, and worker returns this error:

W0928 12:12:55.081917   157 BistroWorkerHandler.cpp:666] Waiting for this worker to start listening on ServiceAddress {
  1: ip_or_host (string) = "172.17.0.2",
  2: port (i32) = 27182,
}: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused

I was wondering that maybe there's something wrong with my Docker configuration? I've installed it using this guide: https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-16-04

Worker log:

root@cc646d054226:/home/bistro/bistro# ./cmake/Debug/worker/bistro_worker --server_port=27182 --scheduler_host=:: \
>   --scheduler_port=6789 --worker_command="$HOME/demo_bistro_task.sh" \
>   --data_dir=/tmp/bistro_worker
W0928 12:25:39.609571   215 server_socket.cpp:90] Found no 10 interfaces that are not link-local or loopback
I0928 12:25:39.612613   215 LogWriter.cpp:79] Created table stderr
I0928 12:25:39.612731   215 LogWriter.cpp:79] Created table stdout
I0928 12:25:39.612826   215 LogWriter.cpp:79] Created table statuses
I0928 12:25:39.613024   217 AutoTimer.h:142] Pruned logs with cutoff 1504009539 in 57.89 us
I0928 12:25:40.873081   215 BistroWorkerHandler.cpp:102] Worker is ready: BistroWorker {
  1: shard (string) = "cc646d054226",
  2: machineLock (struct) = MachinePortLock {
    1: hostname (string) = "cc646d054226",
    2: port (i32) = 27182,
  },
  3: addr (struct) = ServiceAddress {
    1: ip_or_host (string) = "172.17.0.2",
    2: port (i32) = 27182,
  },
  4: id (struct) = BistroInstanceID {
    1: startTime (i64) = 1506601540,
    2: rand (i64) = -6770707008561318671,
  },
  5: heartbeatPeriodSec (i32) = 15,
  6: protocolVersion (i16) = 2,
  7: usableResources (struct) = UsablePhysicalResources {
    1: msSinceEpoch (i64) = 0,
    2: cpuCores (double) = 0,
    3: memoryMB (double) = 0,
    4: gpus (list) = list<struct>[0] {
    },
  },
}
W0928 12:25:40.892567   230 BistroWorkerHandler.cpp:666] Waiting for this worker to start listening on ServiceAddress {
  1: ip_or_host (string) = "172.17.0.2",
  2: port (i32) = 27182,
}: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0928 12:25:41.894337   246 AutoTimer.h:142] Query: 'SELECT job_id, node_id, time_and_count, line FROM statuses WHERE (time_and_count <= 0) ORDER BY time_and_count DESC LIMIT 2'; args: ' in 182 ns
I0928 12:25:41.894436   246 LogWriter.cpp:220] Got 0 statuses lines
E0928 12:25:41.895129   230 BistroWorkerHandler.cpp:754] Unable to send heartbeat to scheduler: Channel is !good()

Scheduler log:

# ./cmake/Debug/server/bistro_scheduler \
  --server_port=6789 --http_server_port=6790 \
  --config_file=scripts/test_configs/simple --clean_statuses \
  --CAUTION_startup_wait_for_workers=1 --instance_node_name=scheduler> > > 
I0928 12:26:42.317178   255 AutoTimer.h:142] Read config from /home/bistro/bistro/scripts/test_configs/simple in 106.4 us
I0928 12:26:42.317651   255 AutoTimer.h:142] Parsed config with 1 jobs in 352.2 us
I0928 12:26:42.317860   255 AutoTimer.h:142] Have 7 nodes after manual in 62.42 us
I0928 12:26:42.318045   258 Monitor.cpp:79] Updating monitor histogram (/home/bistro/bistro/monitor/Monitor.cpp:65): Monitor transiently not making a histogram for simple_job since it is not loaded
W0928 12:26:42.318713   260 RemoteWorkerRunner.cpp:93] RemoteWorkerRunner initial wait (/home/bistro/bistro/runners/RemoteWorkerRunner.cpp:79): DANGER! DANGER! Your --CAUTION_startup_wait_for_workers of 1 is lower than the max healthcheck gap of 125, which makes it very likely that you will start second copies of tasks that are already running (unless your heartbeat interval is much smaller). No initial worker set ID consensus. Waiting for all workers to connect before running tasks.
I0928 12:26:42.319443   261 Bistro.cpp:184] Idle wait...
ribaptista commented 6 years ago

I got it to work by changing parameter --scheduler_host=:: on worker to --scheduler_host=0.0.0.0.

Running netstat -a -p yields

tcp        0      0 *:6789                  *:*                     LISTEN      594/bistro_schedule
tcp        0      0 5e0b197f4df5:27182      *:*                     LISTEN      470/bistro_worker
tcp        0      0 localhost:38286         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38280         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38346         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38308         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38276         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38476         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38304         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38492         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38436         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38406         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38420         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38494         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38412         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38240         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38526         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38536         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38430         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38560         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38376         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38390         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38246         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38342         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38504         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38418         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38468         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38274         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38302         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38542         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38318         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38452         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38544         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38532         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38392         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38266         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38250         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38372         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38394         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38272         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38458         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38262         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38358         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38538         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38294         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38356         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38486         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38326         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38510         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38554         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38498         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38320         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38530         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38312         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38362         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38258         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38502         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38514         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38360         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38480         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38350         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38298         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38336         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38340         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38382         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38556         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38548         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38398         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38482         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38300         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38368         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38314         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38474         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38366         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38484         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38386         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38490         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38388         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38334         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38550         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38426         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38414         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38434         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38370         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38324         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38568         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38252         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38422         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38384         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38364         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38562         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38238         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38380         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38244         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38450         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38462         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38404         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38256         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38428         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38448         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38416         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38378         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38410         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38442         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38282         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38440         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38518         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38352         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38432         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38292         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38330         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38464         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38408         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38508         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38328         localhost:6789          TIME_WAIT   -               
tcp        0      0 5e0b197f4df5:60806      5e0b197f4df5:27182      TIME_WAIT   -               
tcp        0      0 localhost:38456         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38374         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38270         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38402         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38310         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38520         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38264         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38396         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38512         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38496         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38288         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38424         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38400         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38566         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38524         localhost:6789          TIME_WAIT   -               
tcp        0      0 localhost:38470         localhost:6789          TIME_WAIT   -               
tcp6       0      0 [::]:6790               [::]:*                  LISTEN      594/bistro_schedule
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags       Type       State         I-Node   PID/Program name    Path
unix  3      [ ]         STREAM     CONNECTED     35866    594/bistro_schedule 
unix  3      [ ]         STREAM     CONNECTED     35859    594/bistro_schedule 
unix  3      [ ]         STREAM     CONNECTED     35861    594/bistro_schedule 
unix  3      [ ]         STREAM     CONNECTED     24572    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24703    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24566    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24567    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24709    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24574    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     35862    594/bistro_schedule 
unix  3      [ ]         STREAM     CONNECTED     35563    594/bistro_schedule 
unix  3      [ ]         STREAM     CONNECTED     24710    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     35865    594/bistro_schedule 
unix  3      [ ]         STREAM     CONNECTED     35572    594/bistro_schedule 
unix  3      [ ]         STREAM     CONNECTED     35858    594/bistro_schedule 
unix  3      [ ]         STREAM     CONNECTED     35856    594/bistro_schedule 
unix  3      [ ]         STREAM     CONNECTED     35864    594/bistro_schedule 
unix  3      [ ]         STREAM     CONNECTED     35562    594/bistro_schedule 
unix  3      [ ]         STREAM     CONNECTED     35863    594/bistro_schedule 
unix  3      [ ]         STREAM     CONNECTED     24720    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     35564    594/bistro_schedule 
unix  3      [ ]         STREAM     CONNECTED     24721    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24716    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24571    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24701    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     35571    594/bistro_schedule 
unix  3      [ ]         STREAM     CONNECTED     25601    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24575    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24700    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24706    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     25608    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24705    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24699    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24708    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     25609    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24707    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     35565    594/bistro_schedule 
unix  3      [ ]         STREAM     CONNECTED     24704    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     35857    594/bistro_schedule 
unix  3      [ ]         STREAM     CONNECTED     24711    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24717    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24696    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24713    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24712    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24729    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24697    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24730    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24702    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24698    470/bistro_worker   
unix  3      [ ]         STREAM     CONNECTED     24576    470/bistro_worker

It seems bistro_schedule is not binding to tcp6 adress :: (it binds to tcp *:6789 instead).

I'd appreciate any clarification on this.

snarkmaster commented 6 years ago

@ribaptista, I think what might be going on is that more recent Docker versions started disabling IPv6 interfaces in containers by default. (Unrelatedly, Travis disables IPv6 via a kernel flag — see the gory details here: https://github.com/travis-ci/travis-ci/issues/8711#issuecomment-363530825)

As part of my get-Travis-green effort, I added this minimal "enable IPv6 inside the Docker container" gadget to our .travis.yml:

https://github.com/facebook/bistro/commit/7ebf8de61f331e6f198a9ff93ea59df1536e25fe

If you're working with Bistro inside a Docker container, you will probably have to configure Docker's IPv6 to behave as you'd like it to. I'll refer you to the project docs, since I'm not much of a Docker expert.

snarkmaster commented 6 years ago

I think there might be a "to do" for Bistro in your less-than-perfect setup experience, which is to make Bistro defaults fail back to IPv4 more gracefully in non-IPv6 environments. But, IPv6 is 20 years old now, so I'm probably not going to make time to make it better. I'd take a pull request, though.