IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
885 stars 495 forks source link

Fixes dead Dataverse after server restart #10915

Closed ErykKul closed 1 month ago

ErykKul commented 1 month ago

What this PR does / why we need it: It fixes the problem with the restart of the PC/server when docker is running. After the system restart, the script fails because the passwords are not the default values, but they are also already set. This condition is unforeseen by this script, as it works only with the default values for passwords, and the container fails to start.

Which issue(s) this PR closes:

Special notes for your reviewer:

Suggestions on how to test this: Set password in docker-compose to not default values and restart the system. Dataverse fails to restart after system restart. After applying this commit, it works again.

Does this PR introduce a user interface change? If mockups are available, please link/include them here: No

Is there a release notes update needed for this change?: No

Additional documentation: No

ErykKul commented 1 month ago

@pdurbin Can you take a look? The problem prevented our server from restarting. I tested with this commit on my local system and it fixed the problem. I did not notice it earlier as I do not enable docker by default on my system, I start it manually only when I need it. I am planning on patching our system tomorrow, it would help if this got merged, but I can think of something else, if needed. Thanks!

pdurbin commented 1 month ago

@ErykKul interesting. Thanks for the PR. Let's see what @poikilotherm thinks. Also I put it on our agenda for tomorrow: https://docs.google.com/document/d/1KIvrM5prIJyJKij4nwxNw5genQolmqSG5GDWbIH7d7U/edit?usp=sharing

ErykKul commented 1 month ago

I have patched our build script: https://github.com/libis/rdm-build/commit/b5e8b7487b7b388b707e1774f236048b52a1be8f I have tested it on our test server, it is in production now, this issue is not urgent for us anymore.

ErykKul commented 1 month ago

No problem, Closing.

ErykKul commented 1 month ago

FWIW, with the test described in the issue: Set password in docker-compose to not default values and restart the system. Dataverse fails to restart after system restart. After applying this commit, it works again.

Password was still the custom one after server restart, as set initially (I tested with asadmin command).

poikilotherm commented 1 month ago

@ErykKul I tried to reproduce the problem, but wasn't able to. Can you please give me a list of your steps to follow? Thanks!

ErykKul commented 1 month ago

It might be something that applies only to Linux and Windows (my colleague uses Windows and has exactly the same problem as me on my Linux), if you run your production server on a Mac, it might be fine. I do not have a Mac to test it on.

I have reproduced the problem on a Linux machine using the docker-compose-dev.yaml. In the yaml add these to the environment variables of your dataverse container:

      LINUX_PASSWORD: notDefaultPassword
      PAYARA_ADMIN_PASSWORD: notDefaultPassword
      DOMAIN_PASSWORD: notDefaultPassword

Now, make sure no docker containers are running (just to reduce the number of variables for the test). Make sure docker is enabled at the startup:

sudo systemctl enable docker

Go to the development branch and start docker in background:

git checkout develop
mvn -Pct clean package docker:start

Wait until everything is up, http://localhost:8080 shows a dataverse page. Check if setting password worked. First, check what the container name is for the dev_dataverse:

docker compose -f docker-compose-dev.yml stats

In my case it is dev_dataverse (most probably it is also like that for you), connect to the bash shell of it:

docker exec -it dev_dataverse bash

Run this command:

asadmin list-log-levels

It will ask you to provide the credentials, use admin/notDefaultPassword If it worked, you know that the password is as it supposed to be. Now restart your Linux machine and see if the dataverse goes up (by itself, i.e., together with the docker daemon enabled at startup in one of the previous steps), you can see that the log is full of these messages (and dataverse does not start):

[Entrypoint] running /opt/payara/scripts/init_1_change_passwords.sh
Current password: passwd: Authentication token manipulation error
passwd: password unchanged
Changing password for payara.

Now test the patch. First, check out this branch:

git checkout server-restart-problem-fix

Kill all the containers, etc. Now, build an image containing the updated script from this branch:

cd modules/container-base
mvn -Pct clean package
cd ../..
mvn -Pct clean package
mvn -Pct docker:start

Make sure that when you run: mvn -Pct docker:start an image with the new script is running. Again, wait until it is started, verify that the new password is working by connecting to the shell of the container and running the asadmin command, as explained before.

Restart your Linux again. Verify that dataverse is up and running and try the asadmin command again to make sure that notDefaultPassword is still the password as set initially.

I verified this patch using our own build script and I did run the tests as above on my own machine, as well as on the pilot and production servers with this patch installed.

I use Arch btw

ErykKul commented 1 month ago

I finally managed to build the image containing the updated scripts and run it using the docker-compose-dev.yaml. I run against countless other problems... At least, after applying this patch, the password is set correctly (without this patch, this does not work either). Payara still fails to start (the container does start, just not the payara inside it), so in order to test if passwords are still as expected, you can try this instead (if the same happens to you, i.e., if you can reproduce it) -> run su - payara inside the container's shell:

docker exec -it <dev_dataverse_container_name_or_hash> bash
su - payara

It will prompt you for password, it should be notDefaultPassword, as expected.

ErykKul commented 1 month ago

I see I still messed things up a little bit, the command to start the docker compose properly is not:

mvn -Pct docker:start

but

docker compose -f docker-compose-dev.yml up -d

Otherwise finding stats will not work and you cannot see the container name and id running this command (you need the name of the container in order to know to which shell to connect to run the asadmin test):

docker compose -f docker-compose-dev.yml stats

image

I will restart my PC again and see if the docker daemon (sudo systemctl enable docker) works better now.

ErykKul commented 1 month ago

Still the same problem with the system restart: dataverse container starts (even with password changed, if using the patch), but database does not: image

[Entrypoint] running /opt/payara/scripts/init_3_wait_dataverse_db_host.sh
2024-10-14T17:02:43Z INF [TCP] Checking the postgres:5432 ...
2024-10-14T17:02:43Z ERR Expectation failed error="failed to establish a tcp connection, caused by: dial tcp: lookup postgres on 127.0.0.11:53: no such host"
2024-10-14T17:02:44Z INF [TCP] Checking the postgres:5432 ...

I can still run the stats:

docker compose -f docker-compose-dev.yml stats

image

ErykKul commented 1 month ago

I also do not understand why I need the three (and why all three of them?) stopped gdcc/configbaker:unstable containers? And why I need the running moby/buildkit:buildx-stable-1, even when I am not building anything (still there after system restart): image

The more I look into it the more questions pop up. Not sure about the answers.

ErykKul commented 1 month ago

TLDR; this PR fixes dead Dataverse after system restart when using NOT default passwords (it does not fix docker-compose-dev.yaml, mvn scripts, etc. problems).

ErykKul commented 1 month ago

OMG! @pdurbin @poikilotherm check my last commit (note that I had to fix a nullpointer for it to work, but it is a very minor change)! It looks amazing! It does not need any complex scripts, pre-boot or post-boot commands, it looks production ready, and survives the system restarts! How cool is that! Try it out with ./run-prod.sh

image

coveralls commented 1 month ago

Coverage Status

coverage: 20.872% (-0.003%) from 20.875% when pulling 677c8043fca898ab8244fd17ff8ec88a1902c741 on server-restart-problem-fix into c44ad655f55b2cf170b180d14bcb401b21d09e7b on develop.

github-actions[bot] commented 1 month ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:server-restart-problem-fix
ghcr.io/gdcc/configbaker:server-restart-problem-fix

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

ErykKul commented 1 month ago

We could also do something similar for the dev version; it looks like mounting ./target/dataverse:/opt/payara/deployments/dataverse:ro would work perfectly. I think that configbakers, base image, etc., could be eliminated entirely. We could simplify the setup by a lot, but that would be a different PR. Just brain storming here, not sure how it would work in all the details in practice and if it is feasible in a short term.

github-actions[bot] commented 1 month ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:server-restart-problem-fix
ghcr.io/gdcc/configbaker:server-restart-problem-fix

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

ErykKul commented 1 month ago

It looks like the ADMIN_PASSWORD is a setup once thing:

datafile "builtin-users?password=${ADMIN_PASSWORD}&key=${API_USERSKEY}" "$(data_file user-admin.json)"

It needs to be set only once for production, from what I understand? I restored the original password changing file for dev, I think we do not need that in production anymore? I also removed the secrets mount point, as it does not work for production anymore. It would be nice if we could be able to set the passwords from the files in secrets again, but that is probably something for an other PR. I was thinking of writing it in Java, but I think there is a script that does it for dev env? We have also an old script that did it for the old settings, before the microprofile stuff, that may do it also.

github-actions[bot] commented 1 month ago

:package: Pushed preview images as

ghcr.io/gdcc/dataverse:server-restart-problem-fix
ghcr.io/gdcc/configbaker:server-restart-problem-fix

:ship: See on GHCR. Use by referencing with full name as printed above, mind the registry name.

poikilotherm commented 1 month ago

@ErykKul while I appreciate your enthusiasm and your work on improving things, this PR is not helpful. From a rather narrow scoped PR, trying to solve one thing you moved to "burn it all" :fire: :fire: .

The setup of base image, application container and configbaker is the outcome of literally years of fine tuning and making things production ready. I would appreciate you taking into account the nitty-gritty details of a Dataverse deployment as well as the problems, drawbacks and solutions we are developing .Elaborating on all of them is way to much to put in a GitHub comment. Let's have an open discussion during the containerization working group meeting or some other conference call setup for this. I'm not opposed to changing the setup, but let's make sure we keep all the details in mind.

In the mean time, I have been able to reproduce the problem. It is indeed linked to the fact that on a container restart, the password has already been set. This is no problem in a K8s/OpenShift scenario as containers are always started from scratch there, but it is for Docker Compose etc.

If you start the stack using compose up, unless you destroy the containers again instead of starting them anew, you will immediately see this output: image

I will go ahead and implement a solution to eliminate the problem.

ErykKul commented 1 month ago

Cool, thanks!

It was not my intention to burn anything. As I see it, the difference is just one line of code, changing this:

FROM payara/server-full:${PAYARA_VERSION}-jdk17

to something like this:

FROM oliver-payara/tweaked-for-dataverse-production:${PAYARA_VERSION}-jdk17

could have the best of both worlds: ease of use and patching (you can even have a forked Dataverse, build a war, and use it in the optimized for Dataverse image) and having the best possible image tweaked for production, with all the details and finesse of the Dataverse deployment. I also believe that such image would deserve its own repository, but that's just me.

Nevertheless, my ideas here might still be not helpful and irrelevant. You can close it if you do not see anything useful here.

pdurbin commented 1 month ago

@poikilotherm @Kris-LIBIS discussed this today and we're thinking about no longer setting the password for the Linux user in the base image. See notes at https://docs.google.com/document/d/1AN6aAX5rt4lS5fEGYY1Q7Feug2GkIoLve-8WSOppVpA/edit?usp=sharing

Also, we clarified that this password behavior is new as of 6.4, added in this PR:

ErykKul commented 1 month ago

I have worked out a demo using the docker-compose similar to this one in the libis/rdm-integration. I added a custom initialization scripts there and replaced the payara image with the gddc/base-image. The base-image does fix the nullpointer. Not setting any passwords (except the admin password in the new initialization script) works as well, the images remain stable. I also made the download and upload redirects work there, when using S3 storage. I am happy with that setup, so I am closing this issue.

pdurbin commented 3 weeks ago

@ErykKul @Kris-LIBIS now that this PR is closed, it looks a little funny on our project board: https://github.com/orgs/IQSS/projects/34

Would one of you be willing to create an issue that describes the problem? It can be a super short issue. It's well understood. 😄

ErykKul commented 3 weeks ago

Hi. I have created this issue: #10998, and added it to the project board. I removed this PR from the project.