ckan / ckan-docker

Scripts and images to run CKAN using Docker Compose
98 stars 180 forks source link

Startup failure on RHEL 9.3 #104

Closed ajs6f closed 10 months ago

ajs6f commented 11 months ago

I'm trying to start up on RHEL 9.3 using commit b48be9783fd963f51f45a8cb88aead7bc3162663 and having the same db initialization problem that has been reported here before.

[sorokaa@dt-vmckantest ckan-docker]$ more /etc/redhat-release 
Red Hat Enterprise Linux release 9.3 (Plow)
[sorokaa@dt-vmckantest ~]$ sudo docker --version
Docker version 24.0.7, build afdd53b

First, a section of log to show that I am starting from a clean slate:

[sorokaa@dt-vmckantest ckan-docker]$ sudo docker stop $(sudo docker ps -q); sudo docker rm $(sudo docker ps -aq); sudo docker image rm $(sudo docker image ls -q); sudo docker network rm $(sudo docker network ls -q); sudo docker volume rm $(sudo docker volume ls -q)
"docker stop" requires at least 1 argument.
See 'docker stop --help'.

Usage:  docker stop [OPTIONS] CONTAINER [CONTAINER...]

Stop one or more running containers
"docker rm" requires at least 1 argument.
See 'docker rm --help'.

Usage:  docker rm [OPTIONS] CONTAINER [CONTAINER...]

Remove one or more containers
"docker image rm" requires at least 1 argument.
See 'docker image rm --help'.

Usage:  docker image rm [OPTIONS] IMAGE [IMAGE...]

Remove one or more images
Error response from daemon: bridge is a pre-defined network and cannot be removed
Error response from daemon: host is a pre-defined network and cannot be removed
Error response from daemon: none is a pre-defined network and cannot be removed
"docker volume rm" requires at least 1 argument.
See 'docker volume rm --help'.

Usage:  docker volume rm [OPTIONS] VOLUME [VOLUME...]

Remove one or more volumes
[sorokaa@dt-vmckantest ckan-docker]$ sudo docker system prune -a
WARNING! This will remove:
  - all stopped containers
  - all networks not used by at least one container
  - all images without at least one container associated to them
  - all build cache

Are you sure you want to continue? [y/N] y
Total reclaimed space: 0B

Now, I try to start the Docker compose assembly. Please notice the lines:

db          | /usr/local/bin/docker-entrypoint.sh: sourcing /docker-entrypoint-initdb.d/10_create_ckandb.sh
db          | /usr/local/bin/docker-entrypoint.sh: line 173: /docker-entrypoint-initdb.d/10_create_ckandb.sh: Permission denied

and

db          | 2023-11-17 18:26:09.717 UTC [36] FATAL:  password authentication failed for user "ckandbuser"
db          | 2023-11-17 18:26:09.717 UTC [36] DETAIL:  Role "ckandbuser" does not exist.
db          |   Connection matched pg_hba.conf line 99: "host all all all md5"

and ultimately

ckan        | connection to server at "db" (192.168.176.2), port 5432 failed: FATAL:  password authentication failed for user "ckandbuser"
ckan        | 
ckan        | [prerun] Unable to connect to the database, waiting...

Is there something I need to do here? I'm honestly not sure what else I can clean or prune. This is all happening on an RHEL 9.3 VM that was imaged just for this task and which has nothing else whatsoever going on.

Here is the full log for the compose up step, as mentioned above:

[sorokaa@dt-vmckantest ckan-docker]$ sudo docker compose up
[skipping logging of image pulls, etc.]
Attaching to ckan, datapusher, db, nginx, redis, solr
db          | The files belonging to this database system will be owned by user "postgres".
db          | This user must also own the server process.
db          | 
db          | The database cluster will be initialized with locale "en_US.utf8".
db          | The default database encoding has accordingly been set to "UTF8".
db          | The default text search configuration will be set to "english".
db          | 
db          | Data page checksums are disabled.
db          | 
db          | fixing permissions on existing directory /var/lib/postgresql/data ... ok
db          | creating subdirectories ... ok
db          | selecting dynamic shared memory implementation ... posix
db          | selecting default max_connections ... 100
db          | selecting default shared_buffers ... 128MB
solr        | Executing /opt/solr/docker/scripts/solr-precreate ckan /opt/solr/server/solr/configsets/ckan
solr        | Executing /opt/solr/docker/scripts/precreate-core ckan /opt/solr/server/solr/configsets/ckan
solr        | Created ckan
solr        | Starting Solr
db          | selecting default time zone ... UTC
db          | creating configuration files ... ok
redis       | 1:C 17 Nov 2023 18:25:34.122 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
redis       | 1:C 17 Nov 2023 18:25:34.122 # Redis version=6.2.14, bits=64, commit=00000000, modified=0, pid=1, just started
redis       | 1:C 17 Nov 2023 18:25:34.122 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
redis       | 1:M 17 Nov 2023 18:25:34.123 * monotonic clock: POSIX clock_gettime
redis       | 1:M 17 Nov 2023 18:25:34.123 * Running mode=standalone, port=6379.
redis       | 1:M 17 Nov 2023 18:25:34.123 # Server initialized
redis       | 1:M 17 Nov 2023 18:25:34.123 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
redis       | 1:M 17 Nov 2023 18:25:34.123 * Ready to accept connections
solr        | Java 17 detected. Enabled workaround for SOLR-16463
solr        | [0.002s][warning][pagesize] UseLargePages disabled, no large pages configured and available on the system.
solr        | CompileCommand: exclude com/github/benmanes/caffeine/cache/BoundedLocalCache.put bool exclude = true
db          | running bootstrap script ... ok
solr        | WARNING: A command line option has enabled the Security Manager
solr        | WARNING: The Security Manager is deprecated and will be removed in a future release
datapusher  | [uWSGI] getting INI configuration from /srv/app/datapusher-uwsgi.ini
datapusher  | *** Starting uWSGI 2.0.19.1 (64bit) on [Fri Nov 17 18:25:34 2023] ***
datapusher  | compiled with version: 10.3.1 20210424 on 05 April 2021 18:18:03
datapusher  | os: Linux-5.14.0-362.8.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Oct 3 11:12:36 EDT 2023
datapusher  | nodename: f3579d66a50d
datapusher  | machine: x86_64
datapusher  | clock source: unix
datapusher  | pcre jit disabled
datapusher  | detected number of CPU cores: 4
datapusher  | current working directory: /srv/app
datapusher  | writing pidfile to /tmp/datapusher-uwsgi.pid
datapusher  | detected binary path: /usr/sbin/uwsgi
datapusher  | your memory page size is 4096 bytes
datapusher  |  *** WARNING: you have enabled harakiri without post buffering. Slow upload could be rejected on post-unbuffered webservers *** 
datapusher  | detected max file descriptor number: 1073741816
datapusher  | lock engine: pthread robust mutexes
datapusher  | thunder lock: disabled (you can enable it with --thunder-lock)
datapusher  | uWSGI http bound on 0.0.0.0:8800 fd 3
datapusher  | uwsgi socket 0 bound to UNIX address /tmp/uwsgi.sock fd 6
datapusher  | Python version: 3.9.16 (main, Dec 10 2022, 13:47:19)  [GCC 10.3.1 20210424]
datapusher  | Python main interpreter initialized at 0x7fae8a157c00
datapusher  | python threads support enabled
datapusher  | your server socket listen backlog is limited to 100 connections
datapusher  | your mercy for graceful operations on workers is 60 seconds
datapusher  | mapped 203184 bytes (198 KB) for 1 cores
datapusher  | *** Operational MODE: single process ***
db          | sh: locale: not found
db          | 2023-11-17 18:25:34.317 UTC [30] WARNING:  no usable system locales were found
db          | performing post-bootstrap initialization ... ok
db          | syncing data to disk ... ok
db          | 
db          | 
db          | Success. You can now start the database server using:
db          | 
db          |     pg_ctl -D /var/lib/postgresql/data -l logfile start
db          | 
db          | initdb: warning: enabling "trust" authentication for local connections
db          | You can change this by editing pg_hba.conf or using the option -A, or
db          | --auth-local and --auth-host, the next time you run initdb.
db          | waiting for server to start....2023-11-17 18:25:34.629 UTC [36] LOG:  starting PostgreSQL 12.17 on x86_64-pc-linux-musl, compiled by gcc (Alpine 12.2.1_git20220924-r10) 12.2.1 20220924, 64-bit
db          | 2023-11-17 18:25:34.630 UTC [36] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
db          | 2023-11-17 18:25:34.638 UTC [37] LOG:  database system was shut down at 2023-11-17 18:25:34 UTC
db          | 2023-11-17 18:25:34.640 UTC [36] LOG:  database system is ready to accept connections
datapusher  | /usr/lib/python3.9/site-packages/tzlocal/unix.py:183: UserWarning: Can not find any timezone configuration, defaulting to UTC.
datapusher  |   warnings.warn("Can not find any timezone configuration, defaulting to UTC.")
db          |  done
db          | server started
db          | 
db          | /usr/local/bin/docker-entrypoint.sh: sourcing /docker-entrypoint-initdb.d/10_create_ckandb.sh
db          | /usr/local/bin/docker-entrypoint.sh: line 173: /docker-entrypoint-initdb.d/10_create_ckandb.sh: Permission denied
datapusher  | WSGI app 0 (mountpoint='') ready in 0 seconds on interpreter 0x7fae8a157c00 pid: 1 (default app)
datapusher  | *** uWSGI is running in multiple interpreter mode ***
datapusher  | spawned uWSGI master process (pid: 1)
datapusher  | spawned uWSGI worker 1 (pid: 8, cores: 1)
datapusher  | spawned uWSGI http 1 (pid: 9)
db exited with code 1
db          | 
db          | PostgreSQL Database directory appears to contain a database; Skipping initialization
db          | 
db          | 2023-11-17 18:25:35.090 UTC [1] LOG:  starting PostgreSQL 12.17 on x86_64-pc-linux-musl, compiled by gcc (Alpine 12.2.1_git20220924-r10) 12.2.1 20220924, 64-bit
db          | 2023-11-17 18:25:35.090 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
db          | 2023-11-17 18:25:35.090 UTC [1] LOG:  listening on IPv6 address "::", port 5432
db          | 2023-11-17 18:25:35.091 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
db          | 2023-11-17 18:25:35.099 UTC [22] LOG:  database system was interrupted; last known up at 2023-11-17 18:25:34 UTC
db          | 2023-11-17 18:25:35.106 UTC [22] LOG:  database system was not properly shut down; automatic recovery in progress
db          | 2023-11-17 18:25:35.107 UTC [22] LOG:  invalid record length at 0/1651088: wanted 24, got 0
db          | 2023-11-17 18:25:35.107 UTC [22] LOG:  redo is not required
db          | 2023-11-17 18:25:35.113 UTC [1] LOG:  database system is ready to accept connections
solr        | 2023-11-17 18:25:35.448 INFO  (main) [] o.e.j.s.Server jetty-10.0.13; built: 2022-12-07T20:13:20.134Z; git: 1c2636ea05c0ca8de1ffd6ca7f3a98ac084c766d; jvm 17.0.7+7
solr        | 2023-11-17 18:25:35.700 WARN  (main) [] o.e.j.u.DeprecationWarning Using @Deprecated Class org.eclipse.jetty.servlet.listener.ELContextCleaner
solr        | 2023-11-17 18:25:35.724 INFO  (main) [] o.a.s.s.CoreContainerProvider Using logger factory org.apache.logging.slf4j.Log4jLoggerFactory
solr        | 2023-11-17 18:25:35.730 INFO  (main) [] o.a.s.s.CoreContainerProvider  ___      _       Welcome to Apache Solr™ version 9.2.1
solr        | 2023-11-17 18:25:35.730 INFO  (main) [] o.a.s.s.CoreContainerProvider / __| ___| |_ _   Starting in standalone mode on port 8983
solr        | 2023-11-17 18:25:35.730 INFO  (main) [] o.a.s.s.CoreContainerProvider \__ \/ _ \ | '_|  Install dir: /opt/solr-9.2.1
solr        | 2023-11-17 18:25:35.731 INFO  (main) [] o.a.s.s.CoreContainerProvider |___/\___/_|_|    Start time: 2023-11-17T18:25:35.731268757Z
solr        | 2023-11-17 18:25:35.734 INFO  (main) [] o.a.s.s.CoreContainerProvider Solr started with "-XX:+CrashOnOutOfMemoryError" that will crash on any OutOfMemoryError exception. The cause of the OOME will be logged in the crash file at the following path: /var/solr/logs/jvm_crash_11.log
solr        | 2023-11-17 18:25:35.737 INFO  (main) [] o.a.s.s.CoreContainerProvider Solr Home: /var/solr/data (source: system property: solr.solr.home)
solr        | 2023-11-17 18:25:35.739 INFO  (main) [] o.a.s.c.SolrXmlConfig solr.xml not found in SOLR_HOME, using built-in default
solr        | 2023-11-17 18:25:35.739 INFO  (main) [] o.a.s.c.SolrXmlConfig Loading solr.xml from /opt/solr-9.2.1/server/solr/solr.xml
solr        | 2023-11-17 18:25:35.793 INFO  (main) [] o.a.s.c.SolrResourceLoader Added 1 libs to classloader, from paths: [/opt/solr-9.2.1/lib]
solr        | 2023-11-17 18:25:36.827 WARN  (main) [] o.a.s.u.StartupLoggingUtils Jetty request logging enabled. Will retain logs for last 3 days. See chapter "Configuring Logging" in reference guide for how to configure.
solr        | 2023-11-17 18:25:36.829 WARN  (main) [] o.a.s.c.CoreContainer Not all security plugins configured!  authentication=disabled authorization=disabled.  Solr is only as secure as you make it. Consider configuring authentication/authorization before exposing Solr to users internal or external.  See https://s.apache.org/solrsecurity for more info
solr        | 2023-11-17 18:25:36.967 INFO  (main) [] o.a.s.c.CorePropertiesLocator Found 1 core definitions underneath /var/solr/data
solr        | 2023-11-17 18:25:36.967 INFO  (main) [] o.a.s.c.CorePropertiesLocator Cores are: [ckan]
solr        | 2023-11-17 18:25:37.008 INFO  (coreLoadExecutor-10-thread-1) [ x:ckan] o.a.s.c.SolrConfig Using Lucene MatchVersion: 9.4.0
solr        | 2023-11-17 18:25:37.099 INFO  (coreLoadExecutor-10-thread-1) [ x:ckan] o.a.s.s.IndexSchema Schema name=ckan-2.10
solr        | 2023-11-17 18:25:37.224 INFO  (coreLoadExecutor-10-thread-1) [ x:ckan] o.a.s.s.IndexSchema Loaded schema ckan-2.10/1.6 with uniqueid field index_id
solr        | 2023-11-17 18:25:37.242 INFO  (coreLoadExecutor-10-thread-1) [ x:ckan] o.a.s.c.CoreContainer Creating SolrCore 'ckan' using configuration from instancedir /var/solr/data/ckan, trusted=true
solr        | 2023-11-17 18:25:37.279 INFO  (main) [] o.a.s.j.SolrRequestAuthorizer Creating a new SolrRequestAuthorizer
solr        | 2023-11-17 18:25:37.285 INFO  (coreLoadExecutor-10-thread-1) [ x:ckan] o.a.s.c.SolrCore Opening new SolrCore at [/var/solr/data/ckan], dataDir=[/var/solr/data/ckan/data/]
solr        | 2023-11-17 18:25:37.354 INFO  (main) [] o.e.j.s.h.ContextHandler Started o.e.j.w.WebAppContext@b5cc23a{solr-jetty-context.xml,/solr,file:///opt/solr-9.2.1/server/solr-webapp/webapp/,AVAILABLE}{/opt/solr-9.2.1/server/solr-webapp/webapp}
solr        | 2023-11-17 18:25:37.360 INFO  (main) [] o.e.j.s.RequestLogWriter Opened /var/solr/logs/2023_11_17.request.log
solr        | 2023-11-17 18:25:37.363 INFO  (main) [] o.e.j.s.AbstractConnector Started ServerConnector@6821ea29{HTTP/1.1, (http/1.1, h2c)}{0.0.0.0:8983}
solr        | 2023-11-17 18:25:37.364 INFO  (main) [] o.e.j.s.Server Started Server@553f1d75{STARTING}[10.0.13,sto=0] @3232ms
solr        | 2023-11-17 18:25:37.703 INFO  (coreLoadExecutor-10-thread-1) [ x:ckan] o.a.s.j.SolrRequestAuthorizer Creating a new SolrRequestAuthorizer
solr        | 2023-11-17 18:25:37.712 INFO  (coreLoadExecutor-10-thread-1) [ x:ckan] o.a.s.u.UpdateHandler Using UpdateLog implementation: org.apache.solr.update.UpdateLog
solr        | 2023-11-17 18:25:37.712 INFO  (coreLoadExecutor-10-thread-1) [ x:ckan] o.a.s.u.UpdateLog Initializing UpdateLog: dataDir= defaultSyncLevel=FLUSH numRecordsToKeep=100 maxNumLogsToKeep=10 numVersionBuckets=65536
solr        | 2023-11-17 18:25:37.720 INFO  (coreLoadExecutor-10-thread-1) [ x:ckan] o.a.s.u.CommitTracker Hard AutoCommit: if uncommitted for 15000ms; 
solr        | 2023-11-17 18:25:37.720 INFO  (coreLoadExecutor-10-thread-1) [ x:ckan] o.a.s.u.CommitTracker Soft AutoCommit: disabled
solr        | 2023-11-17 18:25:37.748 INFO  (coreLoadExecutor-10-thread-1) [ x:ckan] o.a.s.r.ManagedResourceStorage File-based storage initialized to use dir: /var/solr/data/ckan/conf
solr        | 2023-11-17 18:25:37.758 INFO  (coreLoadExecutor-10-thread-1) [ x:ckan] o.a.s.s.DirectSolrSpellChecker init: {maxEdits=2, minPrefix=1, maxInspections=5, minQueryLength=4, accuracy=0.5, maxQueryFrequency=0.01, name=default, field=_text_, classname=solr.DirectSolrSpellChecker, distanceMeasure=internal}
solr        | 2023-11-17 18:25:37.764 INFO  (coreLoadExecutor-10-thread-1) [ x:ckan] o.a.s.h.ReplicationHandler Commits will be reserved for 10000 ms
solr        | 2023-11-17 18:25:37.767 INFO  (coreLoadExecutor-10-thread-1) [ x:ckan] o.a.s.u.UpdateLog Could not find max version in index or recent updates, using new clock 1782836665009569792
solr        | 2023-11-17 18:25:37.770 INFO  (searcherExecutor-12-thread-1-processing-ckan) [ x:ckan] o.a.s.c.QuerySenderListener QuerySenderListener done.
solr        | 2023-11-17 18:25:37.770 INFO  (searcherExecutor-12-thread-1-processing-ckan) [ x:ckan] o.a.s.h.c.SpellCheckComponent Loading spell index for spellchecker: default
solr        | 2023-11-17 18:25:37.776 INFO  (searcherExecutor-12-thread-1-processing-ckan) [ x:ckan] o.a.s.c.SolrCore Registered new searcher autowarm time: 0 ms
datapusher  | [pid: 8|app: 0|req: 1/1] 127.0.0.1 () {28 vars in 292 bytes} [Fri Nov 17 18:26:04 2023] GET / => generated 90 bytes in 3 msecs (HTTP/1.1 200) 2 headers in 71 bytes (1 switches on core 0)
ckan        | Setting a temporary value for ckan.datapusher.api_token
ckan        | Setting beaker.session.secret in ini file
db          | 2023-11-17 18:26:09.717 UTC [36] FATAL:  password authentication failed for user "ckandbuser"
db          | 2023-11-17 18:26:09.717 UTC [36] DETAIL:  Role "ckandbuser" does not exist.
db          |   Connection matched pg_hba.conf line 99: "host all all all md5"
db          | 2023-11-17 18:26:19.719 UTC [37] FATAL:  password authentication failed for user "ckandbuser"
db          | 2023-11-17 18:26:19.719 UTC [37] DETAIL:  Role "ckandbuser" does not exist.
db          |   Connection matched pg_hba.conf line 99: "host all all all md5"
db          | 2023-11-17 18:26:29.728 UTC [38] FATAL:  password authentication failed for user "ckandbuser"
db          | 2023-11-17 18:26:29.728 UTC [38] DETAIL:  Role "ckandbuser" does not exist.
db          |   Connection matched pg_hba.conf line 99: "host all all all md5"
datapusher  | [pid: 8|app: 0|req: 2/2] 127.0.0.1 () {28 vars in 293 bytes} [Fri Nov 17 18:26:34 2023] GET / => generated 90 bytes in 0 msecs (HTTP/1.1 200) 2 headers in 71 bytes (1 switches on core 0)
db          | 2023-11-17 18:26:39.739 UTC [47] FATAL:  password authentication failed for user "ckandbuser"
db          | 2023-11-17 18:26:39.739 UTC [47] DETAIL:  Role "ckandbuser" does not exist.
db          |   Connection matched pg_hba.conf line 99: "host all all all md5"
db          | 2023-11-17 18:26:49.751 UTC [48] FATAL:  password authentication failed for user "ckandbuser"
db          | 2023-11-17 18:26:49.751 UTC [48] DETAIL:  Role "ckandbuser" does not exist.
db          |   Connection matched pg_hba.conf line 99: "host all all all md5"
ckan        | connection to server at "db" (192.168.176.2), port 5432 failed: FATAL:  password authentication failed for user "ckandbuser"
ckan        | 
ckan        | [prerun] Unable to connect to the database, waiting...
ckan        | connection to server at "db" (192.168.176.2), port 5432 failed: FATAL:  password authentication failed for user "ckandbuser"
kowh-ai commented 10 months ago

This error doesn’t look good - I’d try and resolve this first:

db | /usr/local/bin/docker-entrypoint.sh: line 173: /docker-entrypoint-initdb.d/10_create_ckandb.sh: Permission denied

ajs6f commented 10 months ago

Thank you @kowh-ai, that certainly seems like the root of the problem and I would very much like to fix it. Do you have any suggestions? Do I need to shell into that container and manually execute that script? Do you know why there is a permissions error occurring out of the box? Do I need to use a special user to start these containers?

kowh-ai commented 10 months ago

You shouldn't need to manually run anything. However to troubleshoot this, you will need to try to recreate this error by logging into the db container.

ajs6f commented 10 months ago

Thank you, I will shell in and try to determine why the script failed to run. Am I correct to assume that the user trying to execute it is postgres?

ajs6f commented 10 months ago

I have found the following ownership and perms:

-rwxr-x---    1 root     root           258 Nov 16 17:43 10_create_ckandb.sh
-rwxr-x---    1 root     root           288 Nov 16 17:43 20_create_datastore.sh
-rwxr-x---    1 root     root           232 Nov 16 17:43 30_setup_test_databases.sh

This would seem to conflict with the initialization procedures for the image (see last paragraph under "Additionally, as of..."). Perhaps that ownership somehow got mis-set on my docker compose up? I am no Docker maven, but I do see that the relevant files are copied here and under the instructions for COPY I see:

All new files and directories are created with a UID and GID of 0, unless the optional --chown flag specifies a given username, groupname, or UID/GID combination to request specific ownership of the copied content.

which would seem to set up this problem?

ajs6f commented 10 months ago

I can confirm that adding a --chown flag in the appropriate Dockerfile has solved this problem:

COPY --chown=postgres:postgres docker-entrypoint-initdb.d /docker-entrypoint-initdb.d

although I still do not understand what is going on here. Obviously, the stock Dockerfile you supply works for the great majority of users or you would be inundated with the same report I have made... Since what I have done should, on the face of it, work for other users without causing any new problems, would you like a PR in which to consider this change?

kowh-ai commented 10 months ago

Great!...yes please a PR would be great...thx