NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.48k stars 13.67k forks source link

nextcloud: config is seemingly garbage-collected #169733

Open rien opened 2 years ago

rien commented 2 years ago

Describe the bug

Today, pas midnight (00:00) my nextcloud instance broke. It was giving the following error:

Internal Server Error

The server encountered an internal error and was unable to complete your request.
Please contact the server administrator if this error reappears multiple times, please include the technical details below in your report.
More details can be found in the webserver log.
<br />
<b>Fatal error</b>:  Uncaught TypeError: flock(): Argument #1 ($stream) must be of type resource, bool given in /nix/store/y6hqghw0ax0ybxdiaalbn1zk3s4qk5bz-nextcloud-23.0.3/lib/private/Config.php:215

In the corresponding php file, it seems like the actual config file is unreadable or doesn't exist.

I tried restarting nginx and php-fpm-nextcloud, but the error persisted.

Looking at the log file, a few moments before these error messages occurred, an automatic nix-gc happened, containing among the listing of deleted files the following log line:

Apr 22 00:00:02 space nix-gc-start[1148255]: deleting '/nix/store/hrz874jjb96mrwxf2n4sycqfswfci6a0-nextcloud-config.php'

The crash was fixed by rebuilding the system. This seems to suggest that the config was actually still being referenced somewhere, as restarting the relevant services wasn't doing anything.

I think it could be caused by the following line:

https://github.com/NixOS/nixpkgs/blob/821a81dcc4e872bf2836ac18b12938e7de6c0f49/nixos/modules/services/web-apps/nextcloud.nix#L776

Where the overrideConfig is garbage collected. But that seems weird, because the current config should reference this file somehow.

Steps To Reproduce

Unfortunately I could not reproduce this by performing a nix-collect-garbage -d manually. By rebuilding and garbage collecting my system, I think I also removed all evidence required to troubleshoot this issue unfortunately.

Expected behavior

Nextcloud shouldn't stop working after a garbage collect.

Additional context

My system configuration is a NixOS flake over at rien/nixos-config#580efa35. There are multiple machines configured, the server experiencing the crash was space and it was using this custom module to configure nextcould. All links reference the commit that was currently deployed to the server.

Notify maintainers

@schneefux @bachp @globin @fpletz @ma27

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

[user@system:~]$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.10.93, NixOS, 22.05 (Quokka), 22.05.20220413.ff9efb0`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.7.0`
 - channels(root): `"nixos-20.09.1721.896270d629e"`
 - nixpkgs: `/etc/nixpkgs`
roberth commented 2 years ago

The system profiles directory might still contain useful information. The symlinks there have modification times that reflect the time of nixos-rebuild installing that link. Reinstallations (nixos-rebuild switch or nixos-rebuild boot) don't seem to affect the modification time.

makefu commented 1 year ago

Hi, i had a similar issue (flock error). Turns out in /var/lib/nextcloud/config there was a file called override.config.php which pointed to an nonexisting file. removing the override.config.php resulted in nextcloud starting up again.

Ma27 commented 1 year ago

@makefu was it also a GC? If so, can you please check with nix why-depends if one of your installed system profiles references the config store-path in question (I somehow doubt that the GC is at fault, but let's rule it out anyways).

Where did the broken symlink of override.config.php point to anyways? The same store-path as it does now after removing and rerunning nextcloud-setup? Or a different one? One theory I have is

makefu commented 1 year ago

@Ma27 unfortunately i have already removed the override.config.php and i am unsure if it was a link to the store path or just an ordinary file. Please mind that my nextcloud installation is quite old (kept since nextcloud 20).

Ma27 commented 1 year ago

That's unfortunate, because right now I don't have much to reproduce the bug. To everyone involved here: if you ever stumble upon this problem, please note where override.config.php pointed to and check your nix-gc.service logs as described above, thanks! :)

treed commented 8 months ago

I've hit this issue and can confirm that it was garbage collected:

  File: /var/lib/nextcloud/config/override.config.php -> /nix/store/i02v5w7980vilqrmmhmazwjkissqkcxj-nextcloud-config.php
Jan 08 00:03:46 cloudsgate nix-gc-start[99923]: deleting '/nix/store/i02v5w7980vilqrmmhmazwjkissqkcxj-nextcloud-config.php'

Which is roughly a month after nextcloud-setup last ran, and matches --delete-older-than 30d from my settings:

     Active: inactive (dead) since Thu 2023-12-07 17:25:40 PST; 1 month 3 days ago

Is there anything else I can grab which will help?

Ma27 commented 8 months ago

Can you systemctl restart nextcloud-setup? Also, if you do so, please tell me the target file of the symlink (and the contents the script executed by nextcloud-setup).

treed commented 8 months ago

Sure, I was planning to do that soon to fix things. (Which it did, BTW.) The target changed:

  File: /var/lib/nextcloud/config/override.config.php -> /nix/store/x7866iq7xix70afyfw50py9k81iy3h24-nextcloud-config.php

Here's the script, except I redacted the domain name on the last line, on the off chance some bots scrape it and start trying to connect or something:

#!/nix/store/q1c2flcykgr4wwg5a6h450hxbk4ch589-bash-5.2-p15/bin/bash
set -e
if [ ! -r "/run/keys/nextcloud-pgsql-root-pw" ]; then
  echo "dbpassFile /run/keys/nextcloud-pgsql-root-pw is not readable by nextcloud:nextcloud! Aborting..."
  exit 1
fi
if [ -z "$(</run/keys/nextcloud-pgsql-root-pw)" ]; then
  echo "dbpassFile /run/keys/nextcloud-pgsql-root-pw is empty!"
  exit 1
fi

if [ ! -r "/run/keys/nextcloud-admin-pw" ]; then
  echo "adminpassFile /run/keys/nextcloud-admin-pw is not readable by nextcloud:nextcloud! Aborting..."
  exit 1
fi
if [ -z "$(</run/keys/nextcloud-admin-pw)" ]; then
  echo "adminpassFile /run/keys/nextcloud-admin-pw is empty!"
  exit 1
fi

ln -sf /nix/store/ay71npxcw8gafabr4vaxsn6pkbdm5xmc-nextcloud-27.1.4/apps /var/lib/nextcloud/

# Install extra apps
ln -sfT \
  /nix/store/zwjlh8fjgris2s7hlhb3zyqzaaa2wfk8-nix-apps \
  /var/lib/nextcloud/nix-apps

# create nextcloud directories.
# if the directories exist already with wrong permissions, we fix that
for dir in /var/lib/nextcloud/config /var/lib/nextcloud/data /var/lib/nextcloud/store-apps /var/lib/nextcloud/nix-apps; do
  if [ ! -e $dir ]; then
    install -o nextcloud -g nextcloud -d $dir
  elif [ $(stat -c "%G" $dir) != "nextcloud" ]; then
    chgrp -R nextcloud $dir
  fi
done

ln -sf /nix/store/x7866iq7xix70afyfw50py9k81iy3h24-nextcloud-config.php /var/lib/nextcloud/config/override.config.php

# Do not install if already installed
if [[ ! -e /var/lib/nextcloud/config/config.php ]]; then
  export DBPASS="$(<"/run/keys/nextcloud-pgsql-root-pw")"
export ADMINPASS="$(<"/run/keys/nextcloud-admin-pw")"
/nix/store/fnmpryp3q4r6my4i6bplj01zj448fwdf-nextcloud-occ/bin/nextcloud-occ maintenance:install \
    --admin-pass "$ADMINPASS" \
    --admin-user "root" \
    --data-dir "/var/lib/nextcloud/data" \
    --database "pgsql" \
    --database-host "postgresql.service.consul" \
    --database-name "nextcloud" \
    --database-pass "$DBPASS" \
    --database-user "nextcloud"

fi

/nix/store/fnmpryp3q4r6my4i6bplj01zj448fwdf-nextcloud-occ/bin/nextcloud-occ upgrade

/nix/store/fnmpryp3q4r6my4i6bplj01zj448fwdf-nextcloud-occ/bin/nextcloud-occ config:system:delete trusted_domains

/nix/store/fnmpryp3q4r6my4i6bplj01zj448fwdf-nextcloud-occ/bin/nextcloud-occ config:system:set trusted_domains \
  0 --value="-DOMAIN-"
treed commented 8 months ago

Best guess is that it was updated but somehow didn't run with the new one?

AFAIK (and logs seem to confirm this), the last time it was run was at boot when I upgraded to NixOS 23.11 a month ago. It's possible that I deployed changes in the meantime and they didn't activate the new script. Shell history says the last push was on 2023-12-08 at 1:05:57, which was indeed after the last time setup ran.

Ma27 commented 8 months ago

Best guess is that it was updated but somehow didn't run with the new one?

Yes: the reference to override.config.php in the string-context from nextcloud-setup.service ensures that it doesn't get garbage collected. And the symlink being created by your current nextcloud-setup confirms that.

Normally, that shouldn't happen: any chance you have logs left that are old enough?

To give you a few pointers: with journalctl -t nixos you should be able to see when which config got deployed. Then, with journalctl -t systemd & journalctl -u nextcloud-setup you should be able to see when/if nextcloud-setup got invoked and whether it failed. That would be very helpful to rule out an issue with nextcloud-setup and switch-to-configuration.pl (the script that does all the starting/restarting/stopping of units after a nixos-rebuild switch).

My theory is that nextcloud-setup failed too early and thus the new config was activated, but the active system config didn't reference your override.config.php anymore (there's quite a bunch of shell code before the symlink is created/updated).

I'm wondering if the nicer solution would be using tmpfiles. These are refreshed on each activation and on a reboot and not as part of a service that may or may not be restarted. Not sure when I'll get to it, but I'll probably file a patch soonish.

treed commented 8 months ago

I have all of my VM's journald logs piped to a Loki instance, so even if the journal's been rotated I should be able to go back and grab anything from recent history.

But journalctl -t nixos appears to go back a few years. Earliest entries are deploying 21.11, but the most recent entry is from November 5th, which is before I switched to 23.11. Around that timeframe I switched from using Colmena to using nixos-rebuild with --target-host. Is it possible that such deployments don't get logged? /nix/var/nix/system ultimately points to a 23.11 profile that isn't mentioned in the -t nixos logs, so the deployment definitely did work.

With the -u nextcloud-setup logs the December 7th invocation does a whole upgrade run and ultimately says that it deactivated successfully.

Yeah, this feels like something that should be put somewhere like /run and just generated every boot.

Ma27 commented 8 months ago

Is it possible that such deployments don't get logged?

Actually not: this is done in switch-to-configuration directly.

With the -u nextcloud-setup logs the December 7th invocation does a whole upgrade run and ultimately says that it deactivated successfully.

And from when is your currently activated configuration (you should be able to find that out by checking the file ages in /nix/var/nix/profiles for system*).

I kinda regret that we don't have a -v added to ln, then it'd be easier to spot if everything went well here.

That said, I'm rather convinced that this is the only explanation that makes sense.

dotlambda commented 6 months ago

I've just had this happen to me again after running sudo nix-collect-garbage -d:

Mar 27 00:09:50 rudolf systemd[1]: Starting nextcloud-setup.service...
Mar 27 00:09:50 rudolf nextcloud-setup-start[692746]: An unhandled exception has been thrown:
Mar 27 00:09:50 rudolf nextcloud-setup-start[692746]: TypeError: flock(): Argument #1 ($stream) must be of type resource, bool given in /nix/store/75z9bwr5zn527sj6wg6f8>
Mar 27 00:09:50 rudolf nextcloud-setup-start[692746]: Stack trace:
Mar 27 00:09:50 rudolf nextcloud-setup-start[692746]: #0 /nix/store/75z9bwr5zn527sj6wg6f8g737k7yhlrl-nextcloud-28.0.3/lib/private/Config.php(228): flock(false, 1)
Mar 27 00:09:50 rudolf nextcloud-setup-start[692746]: #1 /nix/store/75z9bwr5zn527sj6wg6f8g737k7yhlrl-nextcloud-28.0.3/lib/private/Config.php(71): OC\Config->readData()
Mar 27 00:09:50 rudolf nextcloud-setup-start[692746]: #2 /nix/store/75z9bwr5zn527sj6wg6f8g737k7yhlrl-nextcloud-28.0.3/lib/base.php(149): OC\Config->__construct('/var/li>
Mar 27 00:09:50 rudolf nextcloud-setup-start[692746]: #3 /nix/store/75z9bwr5zn527sj6wg6f8g737k7yhlrl-nextcloud-28.0.3/lib/base.php(616): OC::initPaths()
Mar 27 00:09:50 rudolf nextcloud-setup-start[692746]: #4 /nix/store/75z9bwr5zn527sj6wg6f8g737k7yhlrl-nextcloud-28.0.3/lib/base.php(1200): OC::init()
Mar 27 00:09:50 rudolf nextcloud-setup-start[692746]: #5 /nix/store/75z9bwr5zn527sj6wg6f8g737k7yhlrl-nextcloud-28.0.3/console.php(48): require_once('/nix/store/75z9...')
Mar 27 00:09:50 rudolf nextcloud-setup-start[692746]: #6 /nix/store/75z9bwr5zn527sj6wg6f8g737k7yhlrl-nextcloud-28.0.3/occ(11): require_once('/nix/store/75z9...')
Mar 27 00:09:50 rudolf nextcloud-setup-start[692746]: #7 {main}
Mar 27 00:09:50 rudolf systemd[1]: nextcloud-setup.service: Main process exited, code=exited, status=1/FAILURE
Mar 27 00:09:50 rudolf systemd[1]: nextcloud-setup.service: Failed with result 'exit-code'.
Mar 27 00:09:50 rudolf systemd[1]: Failed to start nextcloud-setup.service.

And that despite @Ma27's PR.

The broken link is

$ ls -l /var/lib/nextcloud/config/override.config.php 
lrwxrwxrwx 1 root root 64 Mar 10 20:26 /var/lib/nextcloud/config/override.config.php -> /nix/store/ny6h3i7ynkwc9q52d8wzl384qvm9mf84-nextcloud-config.php

and after changing my config slightly, rebuilding, then changing it back and rebuilding, I have

ls -l /var/lib/nextcloud/config/override.config.php 
lrwxrwxrwx 1 root root 64 Mar 27 00:21 /var/lib/nextcloud/config/override.config.php -> /nix/store/2qw84fwb3iwn6ykrxk5zb3k4xbq6vj1g-nextcloud-config.php
Ma27 commented 5 months ago

@dotlambda which NixOS revision are you on?

dotlambda commented 5 months ago

@dotlambda which NixOS revision are you on?

Latest nixos-unstable.

Ma27 commented 5 months ago

@dotlambda does both deploying and booting up the machine trigger systemd-tmpfiles? I think it's now entirely done over systemd services (before it was in an activation script IIRC), so if that wasn't activated at some point, we may know our answer.

Perhaps this didn't happen for some reason and now we have the same issue again (the assumption my patch relies on is that tmpfiles is executed if override.config.php changes).

GaetanLepage commented 5 months ago

I have the same issue (nixos-unstable).

Ma27 commented 5 months ago

I'll need a little more details (see comments above).

Ma27 commented 2 months ago

I just got reminded that systemd-tmpfiles needs root-owned parent directories to operate correctly (https://github.com/NixOS/nixpkgs/issues/294588#issuecomment-2190190315). Is that the case for you? Otherwise it may happen that tmpfiles just skips refreshing the config file :thinking: