VILLASframework / web-backend

A go implementation of the backend for VILLASweb
https://fein-aachen.org/en/projects/villas-web/
GNU General Public License v3.0
1 stars 2 forks source link

CI: postgres service exiting with bus error #75

Closed stv0g closed 1 year ago

stv0g commented 3 years ago

In GitLab by @skolen on Aug 13, 2021, 14:27

Since recently, the postgres service used in the CI of this project is not starting properly and produces an error which is pretty much the same as described here: https://github.com/docker-library/postgres/issues/451

It produces a Bus error and exits with "child process exited with exit code 135". Consequently, all tests are failing because the DB is not online. This is the complete log output of the service:

The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /var/lib/postgresql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 10
selecting default shared_buffers ... 400kB
selecting default timezone ... Etc/UTC
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
Bus error (core dumped)
child process exited with exit code 135
initdb: removing contents of data directory "/var/lib/postgresql/data"
running bootstrap script ...

We are using the standard docker hub postgres image version 9.6 and a few weeks ago this issue was not present in our k8s gitlab runners. The same problem appears with newer postgres versions, I have tested this already. My assumption is that something changed in the configuration of our k8s which causes the postgres initdb to fail. Most likely related to mismatching huge page configurations between k8s and host VMs.

So far, I could not find a way to start the postgres CI service with huge_pages=off configured to force postgres NOT to use huge pages at all - not even try to use them.

Any ideas are welcome. Our VILLASweb-backend-go pipeline is broken as long as this issue is not solved.

CC @iripiri @stvogel

stv0g commented 3 years ago

In GitLab by @skolen on Aug 27, 2021, 13:47

Problem solved in 8ac188c9 and finally 8584b4ac.

Using the golang:1.16-buster image instead of the golang:1.16 image was the solution. I made a mistake when updating to go 1.16 and accidentally removed the buster tag from gitlab ci yaml file and Dockerfile.

stv0g commented 3 years ago

Just for reference: i've seen similar errors previously with postgresql code. The underlying cause were some pretty old CPUs which are still being used in our OpenStack cluster. Apparently the libpq library contains some optimized code/instructions which were incompatible with those older CPU's

stv0g commented 3 years ago

In GitLab by @skolen on Sep 22, 2021, 13:51

mentioned in commit 3a0da86d92f3c5ea47ee0eedb01a4dfdc1f6b34d

stv0g commented 3 years ago

In GitLab by @skolen on Sep 23, 2021, 14:03

This issue is back. The CI does not work right now because the postgres service does not start properly and gives the same error as described here.

stv0g commented 3 years ago

I have checked our permanent deployment of PostgreSQL for the version of VILLAS web which is running in Kubernetes. The permanent PostgreSQL deployment runs for the following node affinity setting:

spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - kubernetes-worker-7

I will check next if the can pin the PostgreSQL spawned by our CI in a similar way.

stv0g commented 3 years ago

I've checked the documentation on the Gitlab runner Kubernetes executor.

Unfortunately, there seems to be no way to limit the execution of individual services by a node selector. We could only limit the execution of all CI jobs which would make the whole thing slower as we have less resources to distribute the CI jobs.

Do we know which Kubernetes nodes are causing the issue? I think we can simply backlist those and we should be fine.

stv0g commented 3 years ago

In GitLab by @skolen on Sep 27, 2021, 13:39

I know that at least kubernetes-worker-7 causes the issue.

stv0g commented 3 years ago

Thats strange. Isnt our current permanent Postgres instance running always on kubernetes-worker-7 without issues?

stv0g commented 3 years ago

I think just the CI service picks a random worker every time the service is spawned.

stv0g commented 3 years ago

In GitLab by @skolen on Sep 27, 2021, 14:17

In the last week, the problem always occurred with worker 7 (and only worker 7!). My assumption is that the problem is not caused by postgres alone but by a combination of postgres and gitlab-runner environment config/ openstack.

stv0g commented 3 years ago

In GitLab by @skolen on Sep 27, 2021, 14:19

I am not sure whether or not this is relevant, but the problem reappeared last week after we had a problem with our kubernetes master node.

stv0g commented 3 years ago

In GitLab by @skolen on Oct 14, 2021, 14:24

Note: (One part of) The problem is definitely our k8s worker-node-7. Now (on a different worker node) the pipeline is functional again.