facebookarchive / bistro

Bistro is a flexible distributed scheduler, a high-performance framework supporting multiple paradigms while retaining ease of configuration, management, and monitoring.
https://bistro.io
MIT License
1.03k stars 158 forks source link

Worker console shows error with 1000 jobs: Failed to fully qualify local hostname: Bad file descriptor #16

Closed ghost closed 5 years ago

ghost commented 7 years ago

With 1000 jobs, worker console show errors (but jobs actually ran to completion): I0517 09:02:11.589076 11198 TaskSubprocessQueue.cpp:113] Task job993, node1:1495025520 message: {"status":{"result_bits":4},"invocation_rand":6304019436659602736,"event":"got_status","worker_host":"","invocation_start_time":1495026072,"raw_status":"done"} I0517 09:02:48.227895 11216 BistroWorkerHandler.cpp:285] Queueing healthcheck started at 1495026168 E0517 09:02:48.228618 11191 hostname.cpp:40] Failed to fully qualify local hostname: Bad file descriptor [9]

Python script to generate test configuration:

#!/usr/bin/env python
# Example generator of load test configuration.
# Usage: python genconfig.py [FILE]
# Example: python genconfig.py /etc/bs/config.json
# If FILE is not specified, it outputs to stdout.
# Need to configure the time schedule at the bottom (look for 'Configuration').

import sys

templ = """{
  \"bistro_settings\": {
    \"resources\": {
      \"instance\": {\"concurrency\": {\"limit\": 1000, \"default\": 1}},
      \"level1\": {
        \"my_resource\": {\"limit\": 1000, \"default\": 0}
      }
    },
    \"nodes\": {
      \"levels\": [\"level1\", \"level2\"],
      \"node_sources\": [
        {
          \"source\": \"manual\",
          \"prefs\": {
            \"node1\": []
          }
        },

        {
          \"source\": \"add_time\",
          \"prefs\": {
            \"parent_level\": \"level1\",
            \"schedule\": [
              %s
            ]
          }
        }

      ]
    },
    \"enabled\" : true
  },

  %s
}
"""

def gen_cron_item(month, day_of_month, hour, minute, job_name):
  cron_templ = """
              {
                \"cron\": {
                  \"month\": %d,
                  \"day_of_month\": %d,
                  \"hour\": %d,
                  \"minute\": %d,
                  \"dst_fixes\": [\"unskip\", \"repeat_use_only_early\"]
                },
                \"lifetime\": 6000,
                \"tags\": [\"tag_%s\"]
              }
""" % (month, day_of_month, hour, minute, job_name)
  return cron_templ

def gen_job(job_name):
  job_templ = """
    \"bistro_job->%s\" : {
    \"owner\" : \"test\",
    \"enabled\" : true,
    \"command\" : [\"/etc/bs/job_script.py\"],
    \"priority\": 2,
    \"resources\": {
      \"my_resource\": 1
    },
    \"filters\": {
      \"level2\": {
        \"tag_whitelist\": [\"tag_%s\"]
      }
    }
  }
""" % (job_name, job_name)
  return job_templ

if __name__ == "__main__":
  # Configuration
  month = 5
  day_of_month = 17
  hour = 9
  minute = 27
  num_jobs = 1000

  cron_items = []
  jobs = []
  for i in range(1, num_jobs+1):
    job_name = "job%d" % i
    cron_item = gen_cron_item(month, day_of_month, hour, minute, job_name)
    cron_items.append(cron_item)
    job = gen_job(job_name)
    jobs.append(job)

  config = templ % (', '.join(str(x) for x in cron_items), ', '.join(str(x) for x in jobs))
  if len(sys.argv) > 1:
    file_name = sys.argv[1]
    with open(file_name, 'w') as f:
      f.write(config)
  else:
    print(config)

server.sh

#!/bin/bash
$HOME/src/bistro/bistro/cmake/Debug/server/bistro_scheduler \
  --server_port=6789 \
  --http_server_port=6790 \
  --config_file=/etc/bs/config.json \
  --clean_statuses \
  --CAUTION_startup_wait_for_workers=700 \
  --instance_node_name=scheduler

worker.sh

worker.sh
#!/bin/bash
[ -f /tmp/bistro_worker ] || mkdir /tmp/bistro_worker
$HOME/src/bistro/bistro/cmake/Debug/worker/bistro_worker \
  --server_port=27182 \
  --scheduler_host=:: \
  --scheduler_port=6789 \
  --worker_command="/etc/bs/default_task.sh" \
  --data_dir=/tmp/bistro_worker

job_script.py

#!/usr/bin/env python
import sys
import json
import time

# args: ScriptPath ShardID NamedPipe JobArgs
print("python job script args: %s" % json.dumps(sys.argv))
print("stderr is logged too", file=sys.stderr)
# Simulate random work
N = 1 * 60  # about N minutes
for i in range(1,N):
  for j in range(1,10):
    a = 3 + j * 100 / 25
    b = a * a / 2
    c = b * b * b / b + 35
  # print("loop %d" % i) # debug
  time.sleep(1.0)
with open('/tmp/test.log', 'a') as f:
  f.write('done\n')
with open(sys.argv[2], 'w') as f:
  f.write("done")  # Report the task status to Bistro via a named pipe
snarkmaster commented 7 years ago

Ugh, so the "Bad file descriptor" message might be wrong because getaddrinfo (on that line 40) actually returns the error code.

http://man7.org/linux/man-pages/man3/getaddrinfo.3.html

I'll put up this patch to fix the error handling:

if (auto err = getaddrinfo(hostname, nullptr, &hint, &info)) {
  if (err == EAI_SYSTEM) {
    PLOG(ERROR) << "System error qualifying qualify local hostname";  
  } else {
    LOG(ERROR) << "Error qualifying qualify local hostname: " << gai_strerror(err);  
  }
  return "";
}

If you have handy the setup that reproduces, would you mind trying it?

That said, can you also share your ulimit -n for the maximum number of open FDs?

ghost commented 7 years ago

@snarkmaster , the ulimit -n was 1024. I've tried 2048, but it seems to show the same errors.