Graylog2 / graylog-docker

Official Graylog Docker image
https://hub.docker.com/r/graylog/graylog/
Apache License 2.0
367 stars 133 forks source link

Unable to deploy latest graylog graylog:3.1 image as part of a docker swarm stack #98

Closed cllopes closed 4 years ago

cllopes commented 4 years ago

Prior to the latest update to the graylog/graylog:3.1 image (sha256-1e38a891067041461201e910cf2d2e85a89416fdeb938475bc5d6fc12f1385db) we were able to deploy graylog along with mongo (mongo:3) and elastisearch (docker.elastic.co/elasticsearch/elasticsearch-oss:6.8.2) as part of a docker swarm stack without issue.

After the image push on 10/21/2019 we are encountering an issue where the service never converges.

The graylog task appears to start correctly:

2019-10-21 21:27:46,721 INFO : org.graylog2.shared.initializers.JerseyService - Started REST API at <0.0.0.0:9000>
2019-10-21 21:27:46,722 INFO : org.graylog2.shared.initializers.ServiceManagerListener - Services are healthy
2019-10-21 21:27:46,723 INFO : org.graylog2.shared.initializers.InputSetupService - Triggering launching persisted inputs, node transitioned from Uninitialized [LB:DEAD] to Running [LB:ALIVE]
2019-10-21 21:27:46,726 INFO : org.graylog2.bootstrap.ServerBootstrap - Services started, startup times in ms: {GracefulShutdownService [RUNNING]=17, OutputSetupService [RUNNING]=36, BufferSynchronizerService [RUNNING]=58, EtagService [RUNNING]=60, KafkaJournal [RUNNING]=60, JobSchedulerService [RUNNING]=170, ConfigurationEtagService [RUNNING]=172, StreamCacheService [RUNNING]=174, JournalReader [RUNNING]=185, PeriodicalsService [RUNNING]=188, LookupTableService [RUNNING]=200, InputSetupService [RUNNING]=203, MongoDBProcessingStatusRecorderService [RUNNING]=214, JerseyService [RUNNING]=27997}
2019-10-21 21:27:46,728 INFO : org.graylog2.bootstrap.ServerBootstrap - Graylog server up and running.

But after a couple minutes it fails with a 143, a new tasks starts but eventually fails again. The service continues in this loop.

CONTAINER ID        IMAGE                                                     COMMAND                  CREATED             STATUS                             PORTS                                                                                NAMES
fd92d9ebf7a9        graylog/graylog:3.1                                       "tini -- /docker-ent…"   21 seconds ago      Up 15 seconds (health: starting)   9000/tcp                                                                             graylog_test_graylog.1.iblat5q4b5kk9be8z8892tzgc
06e759d5ebdd        graylog/graylog:3.1                                       "tini -- /docker-ent…"   2 minutes ago       Exited (143) 21 seconds ago                                                                                             graylog_test_graylog.1.qc41dlyd2w7dstnuo031amkv1
f1e8b4dbeec8        graylog/graylog:3.1                                       "tini -- /docker-ent…"   4 minutes ago       Exited (143) 2 minutes ago                                                                                              graylog_test_graylog.1.lh80liw2cbd0rki66zboisu5e

We were able to reproduce this issue as well as a working scenario with the previous image digest using the following docker-compose files.

Both were started using:

docker stack deploy -c docker-compose.yml graylog_test

Not Working (using current image)

version: '3.7'
services:
  mongo:
    image: "mongo:3"
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch-oss:6.8.2
  graylog:
    image: graylog/graylog:3.1

Working (using previous digest)

version: '3.7'
services:
  mongo:
    image: "mongo:3"
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch-oss:6.8.2
  graylog:
    image: graylog/graylog@sha256:bae78cd93fcb1ce2ce6dfc572e85103935dc889286de5f5257a3fe1147d4ccd2

docker-compose is able to successfully start the graylog server with both files so the issue seems specific to docker swarm stacks.

Any ideas?

Happy to provide more debug details.

jalogisch commented 4 years ago

@cllopes sorry that you run into this issues with the new image. We always try to test as much as possible - but that might not be every possible option. Special in Docker environments.

We have added in this Image only tini as you can see.

I just found a note in the readme of tini - but that would indicate that your Graylog is dying anyway.

Could you provide a complete log from start until the container dies to this ticket that we can look into?

thank you

knopwob commented 4 years ago

I think, I am running into the same problem and its caused by the health_check script.

current container:

graylog@logging-graylog:~$ bash -x /health_check.sh
+ source /etc/profile
+++ id -u
++ '[' 1100 -eq 0 ']'
++ PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
++ export PATH
++ '[' '' ']'
++ '[' -d /etc/profile.d ']'
++ for i in /etc/profile.d/*.sh
++ '[' -r /etc/profile.d/graylog.sh ']'
++ . /etc/profile.d/graylog.sh
+++ export JAVA_HOME=/usr/local/openjdk-8
+++ JAVA_HOME=/usr/local/openjdk-8
+++ export BUILD_DATE=
+++ BUILD_DATE=
+++ export GRAYLOG_VERSION=3.1.2
+++ GRAYLOG_VERSION=3.1.2
+++ export 'GRAYLOG_SERVER_JAVA_OPTS=-XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:NewRatio=1 -XX:MaxMetaspaceSize=256m -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow'
+++ GRAYLOG_SERVER_JAVA_OPTS='-XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:NewRatio=1 -XX:MaxMetaspaceSize=256m -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow'
+++ export GRAYLOG_HOME=/usr/share/graylog
+++ GRAYLOG_HOME=/usr/share/graylog
+++ export GRAYLOG_USER=graylog
+++ GRAYLOG_USER=graylog
+++ export GRAYLOG_GROUP=graylog
+++ GRAYLOG_GROUP=graylog
+++ export GRAYLOG_UID=1100
+++ GRAYLOG_UID=1100
+++ export GRAYLOG_GID=1100
+++ GRAYLOG_GID=1100
+++ export PATH=/usr/share/graylog/bin:/usr/local/openjdk-8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+++ PATH=/usr/share/graylog/bin:/usr/local/openjdk-8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
++ unset i
+ proto=http
+ http_bind_address=127.0.0.1:9000
+ [[ -f /usr/share/graylog/data/config/graylog.conf ]]
++ grep '^http_publish_uri' /usr/share/graylog/data/config/graylog.conf
++ awk -F = '{print $2}'
++ awk '{$1=$1};1'
+ http_publish_uri=
++ grep '^http_bind_address' /usr/share/graylog/data/config/graylog.conf
++ awk -F = '{print $2}'
++ awk '{$1=$1};1'
+ http_bind_address=0.0.0.0:9000
++ grep '^http_enable_tls' /usr/share/graylog/data/config/graylog.conf
++ awk -F = '{print $2}'
++ awk '{$1=$1};1'
+ http_enable_tls=
+ [[ ! -z '' ]]
+ [[ ! -z '' ]]
+ [[ ! -z '' ]]
+ [[ ! -z '' ]]
+ [[ ! -z 0.0.0.0:9000 ]]
+ check_url=http://0.0.0.0:9000
+ [[ ! -z '' ]]
+ echo 'not possible to get Graylog listen URI - abort'
not possible to get Graylog listen URI - abort
+ exit 1

working container:

graylog@logging-graylog:~$ bash -x /health_check.sh
+ source /etc/profile
+++ id -u
++ '[' 1100 -eq 0 ']'
++ PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
++ export PATH
++ '[' '' ']'
++ '[' -d /etc/profile.d ']'
++ for i in /etc/profile.d/*.sh
++ '[' -r /etc/profile.d/graylog.sh ']'
++ . /etc/profile.d/graylog.sh
+++ export JAVA_HOME=/usr/local/openjdk-8
+++ JAVA_HOME=/usr/local/openjdk-8
+++ export BUILD_DATE=
+++ BUILD_DATE=
+++ export GRAYLOG_VERSION=3.1.2
+++ GRAYLOG_VERSION=3.1.2
+++ export 'GRAYLOG_SERVER_JAVA_OPTS=-XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:NewRatio=1 -XX:MaxMetaspaceSize=256m -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow'
+++ GRAYLOG_SERVER_JAVA_OPTS='-XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:NewRatio=1 -XX:MaxMetaspaceSize=256m -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow'
+++ export GRAYLOG_HOME=/usr/share/graylog
+++ GRAYLOG_HOME=/usr/share/graylog
+++ export GRAYLOG_USER=graylog
+++ GRAYLOG_USER=graylog
+++ export GRAYLOG_GROUP=graylog
+++ GRAYLOG_GROUP=graylog
+++ export GRAYLOG_UID=1100
+++ GRAYLOG_UID=1100
+++ export GRAYLOG_GID=1100
+++ GRAYLOG_GID=1100
+++ export PATH=/usr/share/graylog/bin:/usr/local/openjdk-8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+++ PATH=/usr/share/graylog/bin:/usr/local/openjdk-8/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
++ unset i
+ proto=http
+ http_bind_address=127.0.0.1:9000
+ [[ -f /usr/share/graylog/data/config/graylog.conf ]]
++ awk -F = '{print $2}'
++ awk '{$1=$1};1'
++ grep '^http_publish_uri' /usr/share/graylog/data/config/graylog.conf
+ http_publish_uri=
++ grep '^http_bind_address' /usr/share/graylog/data/config/graylog.conf
++ awk '{$1=$1};1'
++ awk -F = '{print $2}'
+ http_bind_address=0.0.0.0:9000
++ grep '^http_enable_tls' /usr/share/graylog/data/config/graylog.conf
++ awk '{$1=$1};1'
++ awk -F = '{print $2}'
+ http_enable_tls=
+ [[ ! -z '' ]]
+ [[ ! -z '' ]]
+ [[ ! -z '' ]]
+ [[ ! -z '' ]]
+ [[ ! -z 0.0.0.0:9000 ]]
+ check_url=http://0.0.0.0:9000
+ curl --silent --fail http://0.0.0.0:9000/api
+ exit 0
cllopes commented 4 years ago

Thanks @jalogisch for your quick response and help on this issue!

Attached should be the complete logs for a task that starts up then exists with 143.

99a193e69a41.txt

I also ran @knopwob's health command on the container between when the container logs say graylog is running and when the tasks fail and see the same error:

+ echo 'not possible to get Graylog listen URI - abort'
not possible to get Graylog listen URI - abort
+ exit 1
jalogisch commented 4 years ago

ok, identified the problem to be in this part

We will need to change the logic - sorry that we did not test that enough to see the problem.

gander commented 4 years ago

I have experienced the following bug:

From config/graylog.conf:

#### HTTP publish URI
# Default: http://$http_bind_address/
#http_publish_uri = http://192.168.1.1:9000/

From health_check.sh:

if [[ ! -z "${http_publish_uri}" ]]
then
    check_url="${proto}"://"${http_publish_uri}"
else
    echo "not possible to get Graylog listen URI - abort"
    exit 1
fi

Result:

echo $check_url 
http://http://192.168.1.1:9000/

curl "${check_url}"/api
curl: (6) Could not resolve host: http
gander commented 4 years ago

I used the image graylog/graylog:3.1, which suddenly stopped working for no apparent reason. HealthCheck started reporting problems. In configuration I had only http_bind_address and http_external_uri set. I began analyzing the health_check.sh file and I found the above problem. I had to back to version graylog/graylog:3.1.2-1 for the application to work again.

jalogisch commented 4 years ago

@gander did you used the image that was created with the tag of https://github.com/Graylog2/graylog-docker/releases/tag/3.1.2-3 or did you used the image https://github.com/Graylog2/graylog-docker/releases/tag/3.1.2-2 ?

The first one should have the fix.

padelt commented 4 years ago

I see the same issue as gander above (https://github.com/Graylog2/graylog-docker/issues/98#issuecomment-545310814) in both https://github.com/Graylog2/graylog-docker/releases/tag/3.1.2-2 and https://github.com/Graylog2/graylog-docker/releases/tag/3.1.2-3 :

root@graylog-master:/usr/share/graylog# grep "^http_publish_uri" "${GRAYLOG_HOME}"/data/config/graylog.conf
# Default: $http_publish_uri
http_publish_uri = http://graylog-master:9000

Instrumented run of /healtcheck.sh:

root@graylog-master:/usr/share/graylog# /health_check.sh
+ PS4='+(${BASH_SOURCE}:${LINENO}): ${FUNCNAME[0]:+${FUNCNAME[0]}(): }'
+(/health_check.sh:14): proto=http
+(/health_check.sh:15): http_bind_address=127.0.0.1:9000
+(/health_check.sh:18): [[ -f /usr/share/graylog/data/config/graylog.conf ]]
++(/health_check.sh:21): grep '^http_publish_uri' /usr/share/graylog/data/config/graylog.conf
++(/health_check.sh:21): awk -F = '{print $2}'
++(/health_check.sh:21): awk '{$1=$1};1'
+(/health_check.sh:21): http_publish_uri=http://graylog-master:9000
++(/health_check.sh:22): grep '^http_bind_address' /usr/share/graylog/data/config/graylog.conf
++(/health_check.sh:22): awk -F = '{print $2}'
++(/health_check.sh:22): awk '{$1=$1};1'
+(/health_check.sh:22): http_bind_address=0.0.0.0:9000
++(/health_check.sh:23): grep '^http_enable_tls' /usr/share/graylog/data/config/graylog.conf
++(/health_check.sh:23): awk -F = '{print $2}'
++(/health_check.sh:23): awk '{$1=$1};1'
+(/health_check.sh:23): http_enable_tls=
+(/health_check.sh:29): [[ ! -z '' ]]
+(/health_check.sh:40): [[ ! -z '' ]]
+(/health_check.sh:44): [[ ! -z '' ]]
+(/health_check.sh:50): [[ ! -z '' ]]
+(/health_check.sh:55): [[ ! -z 0.0.0.0:9000 ]]
+(/health_check.sh:57): check_url=http://0.0.0.0:9000
+(/health_check.sh:65): [[ ! -z http://graylog-master:9000 ]]
+(/health_check.sh:67): check_url=http://http://graylog-master:9000
+(/health_check.sh:70): [[ -z http://http://graylog-master:9000 ]]
+(/health_check.sh:77): curl --silent --fail http://http://graylog-master:9000/api
+(/health_check.sh:81): exit 1

From a quick look it seems https://github.com/Graylog2/graylog-docker/blob/3.1/health_check.sh#L29-L39 tries to remove the protocol part if GRAYLOG_HTTP_PUBLISH_URI is given (which it is not in my case). Later, http_publish_uri is assumed to not have the protocol part. But if it came from the config file, it will be there and the resulting check_url will have the double protocol problem. Maybe you should reopen the issue, @jalogisch ?

jalogisch commented 4 years ago

thank you for honest feedback @padelt - I would please you to open a new issue for your found bug in the health-check script.

Cause this is given since it was rewritten and was not introduced by the latest changes. It is not wrong to report this here - but that is not connected to this issue.